163 lines
6.1 KiB
Text
163 lines
6.1 KiB
Text
Empire has been extended to optionally work with Unicode. This file
|
|
documents the design and implementation of this change.
|
|
|
|
|
|
Traditional Empire Character Set
|
|
--------------------------------
|
|
|
|
Empire has always used plain ASCII. It abused the most significant
|
|
bit for highlighting. Some commands cleared this bit from some input,
|
|
others didn't.
|
|
|
|
The restriction to the archaic ASCII character set bothered some
|
|
players. It is barely serviceable for most western languages other
|
|
than English, and useless for everything else. This is unbecoming for
|
|
a game played around the world.
|
|
|
|
|
|
What is Unicode?
|
|
----------------
|
|
|
|
Unicode is the emerging standard character set for all multi-lingual
|
|
applications. The core of Unicode is identical to the Universal
|
|
Character Set (UCS) defined in ISO 10646.
|
|
|
|
UCS is 31-bit. The most commonly used characters are the range
|
|
0-0xFFFD, the so called Basic Multilingual Plane (BMP).
|
|
|
|
A character set can be encoded in different ways. Popular encodings
|
|
are UCS-4 (four byte wide characters), UCS-2 (two byte wide
|
|
characters; can't represent characters outside the BMP directly), and
|
|
UTF-8 (a multibyte encoding).
|
|
|
|
UTF-8 has a few desirable properties. In particular, it is a
|
|
compatible extension of plain (7-bit) ASCII: every ASCII string is
|
|
also a valid UTF-8 string, and a plain ASCII byte (an octet with the
|
|
most significant bit clear) in an UTF-8 string always encodes the
|
|
ASCII character, i.e. it is never part of a multibyte sequence.
|
|
|
|
To learn more, see the Unicode FAQ, currently at
|
|
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
|
|
|
|
|
Requirements for Unicode Support in Empire
|
|
------------------------------------------
|
|
|
|
* Full backward compatibility to existing clients
|
|
|
|
* Easy to support for clients
|
|
|
|
* Minimal impact on server code; no additional portability headaches
|
|
|
|
* Interoperability between old and new clients and servers
|
|
|
|
|
|
Principles of Design
|
|
--------------------
|
|
|
|
Client/server communications uses what we call external encoding:
|
|
either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or
|
|
UTF-8. The choice between the encoding is under the control of the
|
|
client, and defaults to Empire ASCII. The chosen encoding is a
|
|
property of the session; it doesn't carry over the future sessions.
|
|
Highlighting is only supported for output (from server to client).
|
|
Highlighting in UTF-8 is done with control characters: ASCII SO (Shift
|
|
Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In,
|
|
C-o, decimal 15) stops highlighting. Text encoded in the client's
|
|
external encoding is called user text.
|
|
|
|
There are two internal encodings. We use UTF-8 for player-player
|
|
communication, and Empire ASCII for everything else. Most of the
|
|
time, there's no difference, because ASCII is valid UTF-8. The
|
|
exception is where the highlighting bit can be used. We call such
|
|
text normal text.
|
|
|
|
Input from the client needs to be translated from the client's
|
|
external encoding into internal encoding. We call this input
|
|
filtering. Since highlighting is not supported on input, the result
|
|
is always valid UTF-8. Commands retrieve input that is player-player
|
|
communication directly as UTF-8. Other input is retrieved as ASCII,
|
|
which replaces non-ASCII characters by '?'[1].
|
|
|
|
Input filtering from UTF-8 drops ASCII control characters except
|
|
'\t' and '\n'.
|
|
|
|
Input filtering from ASCII additionally replaces non-ASCII characters
|
|
by '?'. The result is plain ASCII, which is also valid UTF-8.
|
|
|
|
Output to the client needs to be translated from internal encoding to
|
|
the client's external encoding. We call this output filtering. It is
|
|
integrated into the printing functions, i.e. the functions for sending
|
|
output to the client. Most of them accept normal text. Some accept
|
|
UTF-8, and some only plain ASCII; all of these are clearly documented.
|
|
|
|
Output filtering to ASCII doesn't change normal text. In UTF-8 text,
|
|
it replaces non-ASCII characters by '?'.
|
|
|
|
Output filtering to UTF-8 doesn't change UTF-8 text. In normal text,
|
|
it strips highlighting bits and inserts SI/SO control characters in
|
|
their place.
|
|
|
|
|
|
Notes for Clients
|
|
-----------------
|
|
|
|
Clients use session option utf-8 during login to switch the session to
|
|
UTF-8. Highlighting is done differently in UTF-8 sessions. Consult
|
|
doc/clients-howto for details.
|
|
|
|
An ASCII session should work just like previous server versions,
|
|
except for the treatment of control and non-ASCII characters. We
|
|
believe the new behavior makes more sense.
|
|
|
|
How to program your host to let you use Unicode in your client's user
|
|
interface is platform dependent, and beyond the scope of this
|
|
document.
|
|
|
|
Wolfpack's empclient supports UTF-8 if it runs in a terminal that
|
|
understands UTF-8. See its manual page.
|
|
|
|
|
|
Implementation Notes
|
|
--------------------
|
|
|
|
A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in
|
|
member flags of struct player. Session option utf-8 manipulates this
|
|
flag.
|
|
|
|
Input and output filtering code is in src/lib/subs/pr.c.
|
|
|
|
Almost all code is untouched, almost all strings are still normal
|
|
text. Use of the other encodings is commented (well, we tried!).
|
|
|
|
The following commands and features have been changed to cope with
|
|
Unicode:
|
|
|
|
* telegram, announce, pray and turn
|
|
|
|
Get the text of the message as UTF-8. Actually, the commands
|
|
themselves weren't affected, only the common code to get a
|
|
telex-like message, getele().
|
|
|
|
* read, wire and the automatic display of MOTD and game-down messages
|
|
|
|
Display messages as UTF-8. The changes were limited to common
|
|
message display code. Its entry point was renamed from prnf() to
|
|
uprnf(), to make its unusual UTF-8 argument a bit more obvious.
|
|
|
|
* flash and wall
|
|
|
|
Retrieve the text to send as UTF-8, and send it to the recipient(s)
|
|
with output filtering appropriate for their session applied.
|
|
|
|
The text to send can be given on the command line itself. In this
|
|
case it has to be fetched from the raw command line buffer. Since
|
|
that buffer is already UTF-8, no change was required.
|
|
|
|
The commands can also prompt for the text to send. A new function
|
|
ugetstring() patterned after existing getstring() takes care of
|
|
that.
|
|
|
|
Output filtering is handled by pr_flash(). However, flash and wall
|
|
break long lines, and that required some changes for UTF-8.
|
|
Breaking long lines there is probably a bad idea.
|