empserver/doc/unicode

Empire has been extended to optionally work with Unicode.  This file
documents the design and implementation of this change.


Traditional Empire Character Set
--------------------------------

Empire has always used plain ASCII.  It abused the most significant
bit for highlighting.  Some commands cleared this bit from some input,
others didn't.

The restriction to the archaic ASCII character set bothered some
players.  It is barely serviceable for most western languages other
than English, and useless for everything else.  This is unbecoming for
a game played around the world.


What is Unicode?
----------------

Unicode is the emerging standard character set for all multi-lingual
applications.  The core of Unicode is identical to the Universal
Character Set (UCS) defined in ISO 10646.

UCS is 31-bit.  The most commonly used characters are the range
0-0xFFFD, the so called Basic Multilingual Plane (BMP).

A character set can be encoded in different ways.  Popular encodings
are UCS-4 (four byte wide characters), UCS-2 (two byte wide
characters; can't represent characters outside the BMP directly), and
UTF-8 (a multibyte encoding).

UTF-8 has a few desirable properties.  In particular, it is a
compatible extension of plain (7-bit) ASCII: every ASCII string is
also a valid UTF-8 string, and a plain ASCII byte (an octet with the
most significant bit clear) in an UTF-8 string always encodes the
ASCII character, i.e. it is never part of a multibyte sequence.

To learn more, see the Unicode FAQ, currently at
http://www.cl.cam.ac.uk/~mgk25/unicode.html


Requirements for Unicode Support in Empire
------------------------------------------

* Full backward compatibility to existing clients

* Easy to support for clients

* Minimal impact on server code; no additional portability headaches

* Interoperability between old and new clients and servers


Principles of Design
--------------------

Client/server communications uses what we call external encoding:
either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or
UTF-8.  The choice between the encoding is under the control of the
client, and defaults to Empire ASCII.  The chosen encoding is a
property of the session; it doesn't carry over the future sessions.
Highlighting is only supported for output (from server to client).
Highlighting in UTF-8 is done with control characters: ASCII SO (Shift
Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In,
C-o, decimal 15) stops highlighting.  Text encoded in the client's
external encoding is called user text.

There are two internal encodings.  We use UTF-8 for player-player
communication, and Empire ASCII for everything else.  Most of the
time, there's no difference, because ASCII is valid UTF-8.  The
exception is where the highlighting bit can be used.  We call such
text normal text.

Input from the client needs to be translated from the client's
external encoding into internal encoding.  We call this input
filtering.  Since highlighting is not supported on input, the result
is always valid UTF-8.  Commands retrieve input that is player-player
communication directly as UTF-8.  Other input is retrieved as ASCII,
which replaces non-ASCII characters by '?'[1].

Input filtering from UTF-8 drops ASCII control characters except
'\t' and '\n'.

Input filtering from ASCII additionally replaces non-ASCII characters
by '?'.  The result is plain ASCII, which is also valid UTF-8.

Output to the client needs to be translated from internal encoding to
the client's external encoding.  We call this output filtering.  It is
integrated into the printing functions, i.e. the functions for sending
output to the client.  Most of them accept normal text.  Some accept
UTF-8, and some only plain ASCII; all of these are clearly documented.

Output filtering to ASCII doesn't change normal text.  In UTF-8 text,
it replaces non-ASCII characters by '?'.

Output filtering to UTF-8 doesn't change UTF-8 text.  In normal text,
it strips highlighting bits and inserts SI/SO control characters in
their place.


Notes for Clients
-----------------

Clients use session option utf-8 during login to switch the session to
UTF-8.  Highlighting is done differently in UTF-8 sessions.  Consult
doc/clients-howto for details.

An ASCII session should work just like previous server versions,
except for the treatment of control and non-ASCII characters.  We
believe the new behavior makes more sense.

How to program your host to let you use Unicode in your client's user
interface is platform dependent, and beyond the scope of this
document.

Wolfpack's empclient supports UTF-8 if it runs in a terminal that
understands UTF-8.  See its manual page.


Implementation Notes
--------------------

A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in
member flags of struct player.  Session option utf-8 manipulates this
flag.

Input and output filtering code is in src/lib/subs/pr.c.

Almost all code is untouched, almost all strings are still normal
text.  Use of the other encodings is commented (well, we tried!).

The following commands and features have been changed to cope with
Unicode:

* telegram, announce, pray and turn

  Get the text of the message as UTF-8.  Actually, the commands
  themselves weren't affected, only the common code to get a
  telex-like message, getele().

* read, wire and the automatic display of MOTD and game-down messages

  Display messages as UTF-8.  The changes were limited to common
  message display code.  Its entry point was renamed from prnf() to
  uprnf(), to make its unusual UTF-8 argument a bit more obvious.

* flash and wall

  Retrieve the text to send as UTF-8, and send it to the recipient(s)
  with output filtering appropriate for their session applied.

  The text to send can be given on the command line itself.  In this
  case it has to be fetched from the raw command line buffer.  Since
  that buffer is already UTF-8, no change was required.

  The commands can also prompt for the text to send.  A new function
  ugetstring() patterned after existing getstring() takes care of
  that.

  Output filtering is handled by pr_flash().  However, flash and wall
  break long lines, and that required some changes for UTF-8.
  Breaking long lines there is probably a bad idea.