Rewritten in an attempt to present all the initial revision's valuable

information in a more accessible form, and then some.
This commit is contained in:
Markus Armbruster 2005-06-23 19:08:19 +00:00
parent 39b3493851
commit 5a25fd93c5

View file

@ -1,134 +1,163 @@
Unicode changes: Empire has been extended to optionally work with Unicode. This file
documents the design and implementation of this change.
1. login utf-8
Added a login options. The first option is utf-8 and it sets
the PF_UTF8 player's flags. Default is off.
Syntax
options utf-8 -- turns on the utf-8
options utf-8=1 -- turns on the utf-8
options utf-8=0 -- turns off the utf-8
options -- lists current options and their values
2. flash and wall
a. Message as command argument
Interpret raw command line as message text rather than normal
text.
b. Multi-line mode
Read message lines as message text rather than normal text.
c. Break long lines
Count the charactes using utf8 format. This works for both ASCII
and UTF8 formatted strings.
d. Print lines
Print as message text rather than normal text.
3. Telexes and telex-like things
a. read and wire, MOTD and gamedown message
Print as message text rather than normal text.
c. tele, anno, pray, turn.
Read as message text rather than normal text.
4. Input filtering
a. Parsing commands (normal text)
Ignore control and non-ASCII characters when copying argument
strings.
b. Reading normal text command arguments
Replace control and non-ASCII characters, except for tab with
"?'.
c. Reading message text command arguments
Support message text arguments, used by 3a. and 2b. Replace
control and, if NF_UTF8 is off, non-ASCII characters.
5. Output filtering
Output filtering asssumes that there are no control characters or
invalid characters in the output messages. The control characters
and invalid characters are filtered out during input filtering or
that the server will not generate control characters or invalid
characters.
a. Printing normal text
When NF_UTF8 is on, highlighted text is printed using SO/SI.
b. Printing message text
When NF_UTF8 is off, replace UTF8 charactes with '?'.
Definitions: Traditional Empire Character Set
--------------------------------
1. Normal Text Empire has always used plain ASCII. It abused the most significant
For normal text, the following ASCII characters are valid: bit for highlighting. Some commands cleared this bit from some input,
CR, LF and 0x20-0x7e. Normally, LF is an termination action others didn't.
event. Normally, CR is not used except by the server.
Normal Text does not support UTF8 characters. In normal The restriction to the archaic ASCII character set bothered some
text, the 8th bit is used a highlight bit. If the client players. It is barely serviceable for most western languages other
has the utf8 nation flag set, the standout bit is removed than English, and useless for everything else. This is unbecoming for
and the highlight block is prefixed with SO (ASCII standout) a game played around the world.
and suffixed with SI (ASCII standin).
2. Message Text
For message text, the following ASCII characters are valid:
Tab, CR, LF and 0x020-0x7e. Normally, LF is an termination
action event. Normally, CR is not used except by the server.
Message text also supports UTF8 characters if the utf8 nation
flag is turn on otherwise only the ASCII characters are
supported.
Notes: What is Unicode?
----------------
1. Strings that considered message text are commented. Unicode is the emerging standard character set for all multi-lingual
applications. The core of Unicode is identical to the Universal
Character Set (UCS) defined in ISO 10646.
2. Both Normal and Message text are char strings are in the server. UCS is 31-bit. The most commonly used characters are the range
Care needs to be taken as some compiler consider char 0-0xFFFD, the so called Basic Multilingual Plane (BMP).
signed and other default to unsigned char.
3. Unicode functions are prefixed with u. A character set can be encoded in different ways. Popular encodings
are UCS-4 (four byte wide characters), UCS-2 (two byte wide
characters; can't represent characters outside the BMP directly), and
UTF-8 (a multibyte encoding).
Notes for Client Implementors: UTF-8 has a few desirable properties. In particular, it is a
compatible extension of plain (7-bit) ASCII: every ASCII string is
also a valid UTF-8 string, and a plain ASCII byte (an octet with the
most significant bit clear) in an UTF-8 string always encodes the
ASCII character, i.e. it is never part of a multibyte sequence.
ASCII Mode To learn more, see the Unicode FAQ, currently at
http://www.cl.cam.ac.uk/~mgk25/unicode.html
1. If you do not specify a login options, it the server will start the
session in ASCII mode.
2. This is close to the previous mode (<4.2.21) but there is more filtering Requirements for Unicode Support in Empire
to remove non-ASCII characters and ASCII control characters. ------------------------------------------
3. If another client in UTF8 mode tries to send to this client then the * Full backward compatibility to existing clients
server will replace the non-ASCII characters with question marks.
4. The standout works the same as before where the 8th bit indicates that * Easy to support for clients
the character should be highlighted.
UTF8 Mode * Minimal impact on server code; no additional portability headaches
1. The login options must be specified before the play command is sent. * Interoperability between old and new clients and servers
The syntax is 'options utf-8'.
2. The server will filter ASCII control characters but will pass any characters
with the 8 bit set.
3. For the standout mode, the server inserts an ASCII SO character at the Principles of Design
beginning of standout sequence and the server sends an ASCII SI character at --------------------
the end of the standout sequence.
Client/server communications uses what we call external encoding:
either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or
UTF-8. The choice between the encoding is under the control of the
client, and defaults to Empire ASCII. The chosen encoding is a
property of the session; it doesn't carry over the future sessions.
Highlighting is only supported for output (from server to client).
Highlighting in UTF-8 is done with control characters: ASCII SO (Shift
Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In,
C-o, decimal 15) stops highlighting. Text encoded in the client's
external encoding is called user text.
There are two internal encodings. We use UTF-8 for player-player
communication, and Empire ASCII for everything else. Most of the
time, there's no difference, because ASCII is valid UTF-8. The
exception is where the highlighting bit can be used. We call such
text normal text.
Input from the client needs to be translated from the client's
external encoding into internal encoding. We call this input
filtering. Since highlighting is not supported on input, the result
is always valid UTF-8. Commands retrieve input that is player-player
communication directly as UTF-8. Other input is retrieved as ASCII,
which replaces non-ASCII characters by '?'[1].
Input filtering from UTF-8 drops ASCII control characters except
'\t' and '\n'.
Input filtering from ASCII additionally replaces non-ASCII characters
by '?'. The result is plain ASCII, which is also valid UTF-8.
Output to the client needs to be translated from internal encoding to
the client's external encoding. We call this output filtering. It is
integrated into the printing functions, i.e. the functions for sending
output to the client. Most of them accept normal text. Some accept
UTF-8, and some only plain ASCII; all of these are clearly documented.
Output filtering to ASCII doesn't change normal text. In UTF-8 text,
it replaces non-ASCII characters by '?'.
Output filtering to UTF-8 doesn't change UTF-8 text. In normal text,
it strips highlighting bits and inserts SI/SO control characters in
their place.
Notes for Clients
-----------------
Clients use session option utf-8 during login to switch the session to
UTF-8. Highlighting is done differently in UTF-8 sessions. Consult
doc/clients-howto for details.
An ASCII session should work just like previous server versions,
except for the treatment of control and non-ASCII characters. We
believe the new behavior makes more sense.
How to program your host to let you use Unicode in your client's user
interface is platform dependent, and beyond the scope of this
document.
Wolfpack's empclient supports UTF-8 if it runs in a terminal that
understands UTF-8. See its manual page.
Implementation Notes
--------------------
A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in
member flags of struct player. Session option utf-8 manipulates this
flag.
Input and output filtering code is in src/lib/subs/pr.c.
Almost all code is untouched, almost all strings are still normal
text. Use of the other encodings is commented (well, we tried!).
The following commands and features have been changed to cope with
Unicode:
* telegram, announce, pray and turn
Get the text of the message as UTF-8. Actually, the commands
themselves weren't affected, only the common code to get a
telex-like message, getele().
* read, wire and the automatic display of MOTD and game-down messages
Display messages as UTF-8. The changes were limited to common
message display code. Its entry point was renamed from prnf() to
uprnf(), to make its unusual UTF-8 argument a bit more obvious.
* flash and wall
Retrieve the text to send as UTF-8, and send it to the recipient(s)
with output filtering appropriate for their session applied.
The text to send can be given on the command line itself. In this
case it has to be fetched from the raw command line buffer. Since
that buffer is already UTF-8, no change was required.
The commands can also prompt for the text to send. A new function
ugetstring() patterned after existing getstring() takes care of
that.
Output filtering is handled by pr_flash(). However, flash and wall
break long lines, and that required some changes for UTF-8.
Breaking long lines there is probably a bad idea.