Rewritten in an attempt to present all the initial revision's valuable

information in a more accessible form, and then some.
2005-06-23 19:08:19 +00:00 · 2005-06-23 19:08:19 +00:00 · 5a25fd93c5
commit 5a25fd93c5
parent 39b3493851
1 changed files with 145 additions and 116 deletions
--- a/doc/unicode
+++ b/doc/unicode
@ -1,134 +1,163 @@
-Unicode changes:
+Empire has been extended to optionally work with Unicode.  This file
-
+documents the design and implementation of this change.
 1. login utf-8
   Added a login options.  The first option is utf-8 and it sets
   the PF_UTF8 player's flags. Default is off.
   Syntax
   options utf-8       -- turns on the utf-8
   options utf-8=1     -- turns on the utf-8
   options utf-8=0     -- turns off the utf-8
   options             -- lists current options and their values
 2. flash and wall
   a. Message as command argument
      Interpret raw command line as message text rather than normal
      text.
   b. Multi-line mode
      Read message lines as message text rather than normal text.
   c. Break long lines
      Count the charactes using utf8 format.  This works for both ASCII
      and UTF8 formatted strings.
   d. Print lines
      Print as message text rather than normal text.
 3. Telexes and telex-like things
   a. read and wire, MOTD and gamedown message
      Print as message text rather than normal text.
   c. tele, anno, pray, turn.
      Read as message text rather than normal text.
 4. Input filtering
   a. Parsing commands (normal text)
      Ignore control and non-ASCII characters when copying argument
      strings.
   b. Reading normal text command arguments
      Replace control and non-ASCII characters, except for tab with
      "?'.
   c. Reading message text command arguments
      Support message text arguments, used by 3a. and 2b.  Replace
      control and, if NF_UTF8 is off, non-ASCII characters.
 5. Output filtering
   Output filtering asssumes that there are no control characters or
   invalid characters in the output messages.  The control characters
   and invalid characters are filtered out during input filtering or
   that the server will not generate control characters or invalid
   characters.
   a. Printing normal text
      When NF_UTF8 is on, highlighted text is printed using SO/SI.
   b. Printing message text
      When NF_UTF8 is off, replace UTF8 charactes with '?'.
-Definitions:
+Traditional Empire Character Set
 --------------------------------
-1. Normal Text
+Empire has always used plain ASCII.  It abused the most significant
-	For normal text, the following ASCII characters are valid:
+bit for highlighting.  Some commands cleared this bit from some input,
-	CR, LF and 0x20-0x7e.  Normally, LF is an termination action
+others didn't.
-	event.	Normally, CR is not used except by the server.
+
-	Normal Text does not support UTF8 characters.  In normal
+The restriction to the archaic ASCII character set bothered some
-	text, the 8th bit is used a highlight bit.  If the client
+players.  It is barely serviceable for most western languages other
-	has the utf8 nation flag set, the standout bit is removed
+than English, and useless for everything else.  This is unbecoming for
-	and the highlight block is prefixed with SO (ASCII standout)
+a game played around the world.
 	and suffixed with SI (ASCII standin).
 2. Message Text
 	For message text, the following ASCII characters are valid:
 	Tab, CR, LF and 0x020-0x7e.  Normally, LF is an termination
 	action event.	Normally, CR is not used except by the server.
 	Message text also supports UTF8 characters if the utf8 nation
 	flag is	turn on otherwise only the ASCII characters are
 	supported.
-Notes:
+What is Unicode?
 ----------------
-1. Strings that considered message text are commented.
+Unicode is the emerging standard character set for all multi-lingual
 applications.  The core of Unicode is identical to the Universal
 Character Set (UCS) defined in ISO 10646.
-2. Both Normal and Message text are char strings are in the server.
+UCS is 31-bit.  The most commonly used characters are the range
-	Care needs to be taken as some compiler consider char
+0-0xFFFD, the so called Basic Multilingual Plane (BMP).
 	signed and other default to unsigned char.
-3. Unicode functions are prefixed with u.
+A character set can be encoded in different ways.  Popular encodings
 are UCS-4 (four byte wide characters), UCS-2 (two byte wide
 characters; can't represent characters outside the BMP directly), and
 UTF-8 (a multibyte encoding).
-Notes for Client Implementors:
+UTF-8 has a few desirable properties.  In particular, it is a
 compatible extension of plain (7-bit) ASCII: every ASCII string is
 also a valid UTF-8 string, and a plain ASCII byte (an octet with the
 most significant bit clear) in an UTF-8 string always encodes the
 ASCII character, i.e. it is never part of a multibyte sequence.
-ASCII Mode
+To learn more, see the Unicode FAQ, currently at
 http://www.cl.cam.ac.uk/~mgk25/unicode.html
 1. If you do not specify a login options, it the server will start the
 session in ASCII mode.
-2. This is close to the previous mode (<4.2.21) but there is more filtering
+Requirements for Unicode Support in Empire
-to remove non-ASCII characters and ASCII control characters.
+------------------------------------------
-3. If another client in UTF8 mode tries to send to this client then the
+* Full backward compatibility to existing clients
 server will replace the non-ASCII characters with question marks.
-4. The standout works the same as before where the 8th bit indicates that
+* Easy to support for clients
 the character should be highlighted.
-UTF8 Mode
+* Minimal impact on server code; no additional portability headaches
-1. The login options must be specified before the play command is sent.
+* Interoperability between old and new clients and servers
 The syntax is 'options utf-8'.
 2. The server will filter ASCII control characters but will pass any characters
 with the 8 bit set.
-3. For the standout mode, the server inserts an ASCII SO character at the
+Principles of Design
-beginning of standout sequence and the server sends an ASCII SI character at
+--------------------
-the end of the standout sequence.
+
 Client/server communications uses what we call external encoding:
 either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or
 UTF-8.  The choice between the encoding is under the control of the
 client, and defaults to Empire ASCII.  The chosen encoding is a
 property of the session; it doesn't carry over the future sessions.
 Highlighting is only supported for output (from server to client).
 Highlighting in UTF-8 is done with control characters: ASCII SO (Shift
 Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In,
 C-o, decimal 15) stops highlighting.  Text encoded in the client's
 external encoding is called user text.
 There are two internal encodings.  We use UTF-8 for player-player
 communication, and Empire ASCII for everything else.  Most of the
 time, there's no difference, because ASCII is valid UTF-8.  The
 exception is where the highlighting bit can be used.  We call such
 text normal text.
 Input from the client needs to be translated from the client's
 external encoding into internal encoding.  We call this input
 filtering.  Since highlighting is not supported on input, the result
 is always valid UTF-8.  Commands retrieve input that is player-player
 communication directly as UTF-8.  Other input is retrieved as ASCII,
 which replaces non-ASCII characters by '?'[1].
 Input filtering from UTF-8 drops ASCII control characters except
 '\t' and '\n'.
 Input filtering from ASCII additionally replaces non-ASCII characters
 by '?'.  The result is plain ASCII, which is also valid UTF-8.
 Output to the client needs to be translated from internal encoding to
 the client's external encoding.  We call this output filtering.  It is
 integrated into the printing functions, i.e. the functions for sending
 output to the client.  Most of them accept normal text.  Some accept
 UTF-8, and some only plain ASCII; all of these are clearly documented.
 Output filtering to ASCII doesn't change normal text.  In UTF-8 text,
 it replaces non-ASCII characters by '?'.
 Output filtering to UTF-8 doesn't change UTF-8 text.  In normal text,
 it strips highlighting bits and inserts SI/SO control characters in
 their place.
 Notes for Clients
 -----------------
 Clients use session option utf-8 during login to switch the session to
 UTF-8.  Highlighting is done differently in UTF-8 sessions.  Consult
 doc/clients-howto for details.
 An ASCII session should work just like previous server versions,
 except for the treatment of control and non-ASCII characters.  We
 believe the new behavior makes more sense.
 How to program your host to let you use Unicode in your client's user
 interface is platform dependent, and beyond the scope of this
 document.
 Wolfpack's empclient supports UTF-8 if it runs in a terminal that
 understands UTF-8.  See its manual page.
 Implementation Notes
 --------------------
 A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in
 member flags of struct player.  Session option utf-8 manipulates this
 flag.
 Input and output filtering code is in src/lib/subs/pr.c.
 Almost all code is untouched, almost all strings are still normal
 text.  Use of the other encodings is commented (well, we tried!).
 The following commands and features have been changed to cope with
 Unicode:
 * telegram, announce, pray and turn
  Get the text of the message as UTF-8.  Actually, the commands
  themselves weren't affected, only the common code to get a
  telex-like message, getele().
 * read, wire and the automatic display of MOTD and game-down messages
  Display messages as UTF-8.  The changes were limited to common
  message display code.  Its entry point was renamed from prnf() to
  uprnf(), to make its unusual UTF-8 argument a bit more obvious.
 * flash and wall
  Retrieve the text to send as UTF-8, and send it to the recipient(s)
  with output filtering appropriate for their session applied.
  The text to send can be given on the command line itself.  In this
  case it has to be fetched from the raw command line buffer.  Since
  that buffer is already UTF-8, no change was required.
  The commands can also prompt for the text to send.  A new function
  ugetstring() patterned after existing getstring() takes care of
  that.
  Output filtering is handled by pr_flash().  However, flash and wall
  break long lines, and that required some changes for UTF-8.
  Breaking long lines there is probably a bad idea.