-Unicode changes:
+Empire has been extended to optionally work with Unicode. This file
+documents the design and implementation of this change.
-1. toggle UTF-8
- Add utf8 as a toggle option and store in the nat_flags field in
- nation structure. In the future, this should be a login option
- rather than a country toggle once the login options are added.
+Traditional Empire Character Set
+--------------------------------
-2. flash and wall
+Empire has always used plain ASCII. It abused the most significant
+bit for highlighting. Some commands cleared this bit from some input,
+others didn't.
- a. Message as command argument
+The restriction to the archaic ASCII character set bothered some
+players. It is barely serviceable for most western languages other
+than English, and useless for everything else. This is unbecoming for
+a game played around the world.
- Interpret raw command line as message text rather than normal
- text.
- b. Multi-line mode
+What is Unicode?
+----------------
- Read message lines as message text rather than normal text.
+Unicode is the emerging standard character set for all multi-lingual
+applications. The core of Unicode is identical to the Universal
+Character Set (UCS) defined in ISO 10646.
- c. Break long lines
+UCS is 31-bit. The most commonly used characters are the range
+0-0xFFFD, the so called Basic Multilingual Plane (BMP).
- Count the charactes using utf8 format. This works for both ASCII
- and UTF8 formatted strings.
+A character set can be encoded in different ways. Popular encodings
+are UCS-4 (four byte wide characters), UCS-2 (two byte wide
+characters; can't represent characters outside the BMP directly), and
+UTF-8 (a multibyte encoding).
- d. Print lines
+UTF-8 has a few desirable properties. In particular, it is a
+compatible extension of plain (7-bit) ASCII: every ASCII string is
+also a valid UTF-8 string, and a plain ASCII byte (an octet with the
+most significant bit clear) in an UTF-8 string always encodes the
+ASCII character, i.e. it is never part of a multibyte sequence.
- Print as message text rather than normal text.
+To learn more, see the Unicode FAQ, currently at
+<http://www.cl.cam.ac.uk/~mgk25/unicode.html>.
-3. Telexes and telex-like things
- a. read and wire, MOTD and gamedown message
+Requirements for Unicode Support in Empire
+------------------------------------------
- Print as message text rather than normal text.
+* Full backward compatibility to existing clients
- c. tele, anno, pray, turn.
+* Easy to support for clients
- Read as message text rather than normal text.
+* Minimal impact on server code; no additional portability headaches
-4. Input filtering
+* Interoperability between old and new clients and servers
- a. Parsing commands (normal text)
- Ignore control and non-ASCII characters when copying argument
- strings.
+Principles of Design
+--------------------
- b. Reading normal text command arguments
+Client/server communications uses what we call external encoding:
+either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or
+UTF-8. The choice between the encoding is under the control of the
+client, and defaults to Empire ASCII. The chosen encoding is a
+property of the session; it doesn't carry over the future sessions.
+Highlighting is only supported for output (from server to client).
+Highlighting in UTF-8 is done with control characters: ASCII SO (Shift
+Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In,
+C-o, decimal 15) stops highlighting. Text encoded in the client's
+external encoding is called user text.
- Replace control and non-ASCII characters, except for tab with
- "?'.
+There are two internal encodings. We use UTF-8 for player-player
+communication, and Empire ASCII for everything else. Most of the
+time, there's no difference, because ASCII is valid UTF-8. The
+exception is where the highlighting bit can be used. We call such
+text normal text.
- c. Reading message text command arguments
+Input from the client needs to be translated from the client's
+external encoding into internal encoding. We call this input
+filtering. Since highlighting is not supported on input, the result
+is always valid UTF-8. Commands retrieve input that is player-player
+communication directly as UTF-8. Other input is retrieved as ASCII,
+with non-ASCII characters replaced by '?'.
- Support message text arguments, used by 3a. and 2b. Replace
- control and, if NF_UTF8 is off, non-ASCII characters.
+Input filtering from UTF-8 drops ASCII control characters except
+'\t' and '\n'.
-5. Output filtering
+Input filtering from ASCII additionally replaces non-ASCII characters
+by '?'. The result is plain ASCII, which is also valid UTF-8.
- Output filtering asssumes that there are no control characters or
- invalid characters in the output messages. The control characters
- and invalid characters are filtered out during input filtering or
- that the server will not generate control characters or invalid
- characters.
+Output to the client needs to be translated from internal encoding to
+the client's external encoding. We call this output filtering. It is
+integrated into the printing functions, i.e. the functions for sending
+output to the client. Most of them accept normal text. Some accept
+UTF-8, and some only plain ASCII; all of these are clearly documented.
- a. Printing normal text
+Output filtering to ASCII doesn't change normal text. In UTF-8 text,
+it replaces non-ASCII characters by '?'.
- When NF_UTF8 is on, highlighted text is printed using SO/SI.
+Output filtering to UTF-8 doesn't change UTF-8 text. In normal text,
+it strips highlighting bits and inserts SI/SO control characters in
+their place.
- b. Printing message text
- When NF_UTF8 is off, replace UTF8 charactes with '?'.
+Notes for Clients
+-----------------
+Clients use session option utf-8 during login to switch the session to
+UTF-8. Highlighting is done differently in UTF-8 sessions. Consult
+doc/clients-howto for details.
-Definitions:
+An ASCII session should work just like previous server versions,
+except for the treatment of control and non-ASCII characters. We
+believe the new behavior makes more sense.
-1. Normal Text
- For normal text, the following ASCII characters are valid:
- CR, LF and 0x20-0x7e. Normally, LF is an termination action
- event. Normally, CR is not used except by the server.
- Normal Text does not support UTF8 characters. In normal
- text, the 8th bit is used a highlight bit. If the client
- has the utf8 nation flag set, the standout bit is removed
- and the highlight block is prefixed with SO (ASCII standout)
- and suffixed with SI (ASCII standin).
-
-2. Message Text
- For message text, the following ASCII characters are valid:
- Tab, CR, LF and 0x020-0x7e. Normally, LF is an termination
- action event. Normally, CR is not used except by the server.
- Message text also supports UTF8 characters if the utf8 nation
- flag is turn on otherwise only the ASCII characters are
- supported.
+How to program your host to let you use Unicode in your client's user
+interface is platform dependent, and beyond the scope of this
+document.
+Wolfpack's empclient supports UTF-8 if it runs in a terminal that
+understands UTF-8. See its manual page.
-Notes:
-1. Strings that considered message text are commented.
+Implementation Notes
+--------------------
-2. Both Normal and Message text are char strings are in the server.
- Care needs to be taken as some compiler consider char
- signed and other default to unsigned char.
+A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in
+member flags of struct player. Session option utf-8 manipulates this
+flag.
-3. Unicode functions are prefixed with u.
+Input and output filtering code is in src/lib/subs/pr.c.
+
+Almost all code is untouched, almost all strings are still normal
+text. Use of the other encodings is commented (well, we tried!).
+
+The following commands and features have been changed to cope with
+Unicode:
+
+* telegram, announce, pray and turn
+
+ Get the text of the message as UTF-8. Actually, the commands
+ themselves weren't affected, only the common code to get a
+ telex-like message, getele().
+
+* read, wire and the automatic display of MOTD and game-down messages
+
+ Display messages as UTF-8. The changes were limited to common
+ message display code. Its entry point was renamed from prnf() to
+ uprnf(), to make its unusual UTF-8 argument a bit more obvious.
+
+* flash and wall
+
+ Retrieve the text to send as UTF-8, and send it to the recipient(s)
+ with output filtering appropriate for their session applied.
+
+ The text to send can be given on the command line itself. In this
+ case it has to be fetched from the raw command line buffer. Since
+ that buffer is already UTF-8, no change was required.
+
+ The commands can also prompt for the text to send. A new function
+ ugetstring() patterned after existing getstring() takes care of
+ that.
+
+ Output filtering is handled by pr_flash().