From 5a25fd93c52db7c199feea9b56cd344456a342db Mon Sep 17 00:00:00 2001 From: Markus Armbruster Date: Thu, 23 Jun 2005 19:08:19 +0000 Subject: [PATCH] Rewritten in an attempt to present all the initial revision's valuable information in a more accessible form, and then some. --- doc/unicode | 203 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 116 insertions(+), 87 deletions(-) diff --git a/doc/unicode b/doc/unicode index 49b1f9ab8..a5b5c11da 100644 --- a/doc/unicode +++ b/doc/unicode @@ -1,134 +1,163 @@ -Unicode changes: +Empire has been extended to optionally work with Unicode. This file +documents the design and implementation of this change. -1. login utf-8 - Added a login options. The first option is utf-8 and it sets - the PF_UTF8 player's flags. Default is off. - Syntax - options utf-8 -- turns on the utf-8 - options utf-8=1 -- turns on the utf-8 - options utf-8=0 -- turns off the utf-8 - options -- lists current options and their values +Traditional Empire Character Set +-------------------------------- -2. flash and wall +Empire has always used plain ASCII. It abused the most significant +bit for highlighting. Some commands cleared this bit from some input, +others didn't. - a. Message as command argument +The restriction to the archaic ASCII character set bothered some +players. It is barely serviceable for most western languages other +than English, and useless for everything else. This is unbecoming for +a game played around the world. - Interpret raw command line as message text rather than normal - text. - b. Multi-line mode +What is Unicode? +---------------- - Read message lines as message text rather than normal text. +Unicode is the emerging standard character set for all multi-lingual +applications. The core of Unicode is identical to the Universal +Character Set (UCS) defined in ISO 10646. - c. Break long lines +UCS is 31-bit. The most commonly used characters are the range +0-0xFFFD, the so called Basic Multilingual Plane (BMP). - Count the charactes using utf8 format. This works for both ASCII - and UTF8 formatted strings. +A character set can be encoded in different ways. Popular encodings +are UCS-4 (four byte wide characters), UCS-2 (two byte wide +characters; can't represent characters outside the BMP directly), and +UTF-8 (a multibyte encoding). - d. Print lines +UTF-8 has a few desirable properties. In particular, it is a +compatible extension of plain (7-bit) ASCII: every ASCII string is +also a valid UTF-8 string, and a plain ASCII byte (an octet with the +most significant bit clear) in an UTF-8 string always encodes the +ASCII character, i.e. it is never part of a multibyte sequence. - Print as message text rather than normal text. +To learn more, see the Unicode FAQ, currently at +http://www.cl.cam.ac.uk/~mgk25/unicode.html -3. Telexes and telex-like things - a. read and wire, MOTD and gamedown message +Requirements for Unicode Support in Empire +------------------------------------------ - Print as message text rather than normal text. +* Full backward compatibility to existing clients - c. tele, anno, pray, turn. +* Easy to support for clients - Read as message text rather than normal text. +* Minimal impact on server code; no additional portability headaches -4. Input filtering +* Interoperability between old and new clients and servers - a. Parsing commands (normal text) - Ignore control and non-ASCII characters when copying argument - strings. +Principles of Design +-------------------- - b. Reading normal text command arguments +Client/server communications uses what we call external encoding: +either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or +UTF-8. The choice between the encoding is under the control of the +client, and defaults to Empire ASCII. The chosen encoding is a +property of the session; it doesn't carry over the future sessions. +Highlighting is only supported for output (from server to client). +Highlighting in UTF-8 is done with control characters: ASCII SO (Shift +Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In, +C-o, decimal 15) stops highlighting. Text encoded in the client's +external encoding is called user text. - Replace control and non-ASCII characters, except for tab with - "?'. +There are two internal encodings. We use UTF-8 for player-player +communication, and Empire ASCII for everything else. Most of the +time, there's no difference, because ASCII is valid UTF-8. The +exception is where the highlighting bit can be used. We call such +text normal text. - c. Reading message text command arguments +Input from the client needs to be translated from the client's +external encoding into internal encoding. We call this input +filtering. Since highlighting is not supported on input, the result +is always valid UTF-8. Commands retrieve input that is player-player +communication directly as UTF-8. Other input is retrieved as ASCII, +which replaces non-ASCII characters by '?'[1]. - Support message text arguments, used by 3a. and 2b. Replace - control and, if NF_UTF8 is off, non-ASCII characters. +Input filtering from UTF-8 drops ASCII control characters except +'\t' and '\n'. -5. Output filtering +Input filtering from ASCII additionally replaces non-ASCII characters +by '?'. The result is plain ASCII, which is also valid UTF-8. - Output filtering asssumes that there are no control characters or - invalid characters in the output messages. The control characters - and invalid characters are filtered out during input filtering or - that the server will not generate control characters or invalid - characters. +Output to the client needs to be translated from internal encoding to +the client's external encoding. We call this output filtering. It is +integrated into the printing functions, i.e. the functions for sending +output to the client. Most of them accept normal text. Some accept +UTF-8, and some only plain ASCII; all of these are clearly documented. - a. Printing normal text +Output filtering to ASCII doesn't change normal text. In UTF-8 text, +it replaces non-ASCII characters by '?'. - When NF_UTF8 is on, highlighted text is printed using SO/SI. +Output filtering to UTF-8 doesn't change UTF-8 text. In normal text, +it strips highlighting bits and inserts SI/SO control characters in +their place. - b. Printing message text - When NF_UTF8 is off, replace UTF8 charactes with '?'. +Notes for Clients +----------------- +Clients use session option utf-8 during login to switch the session to +UTF-8. Highlighting is done differently in UTF-8 sessions. Consult +doc/clients-howto for details. -Definitions: +An ASCII session should work just like previous server versions, +except for the treatment of control and non-ASCII characters. We +believe the new behavior makes more sense. -1. Normal Text - For normal text, the following ASCII characters are valid: - CR, LF and 0x20-0x7e. Normally, LF is an termination action - event. Normally, CR is not used except by the server. - Normal Text does not support UTF8 characters. In normal - text, the 8th bit is used a highlight bit. If the client - has the utf8 nation flag set, the standout bit is removed - and the highlight block is prefixed with SO (ASCII standout) - and suffixed with SI (ASCII standin). - -2. Message Text - For message text, the following ASCII characters are valid: - Tab, CR, LF and 0x020-0x7e. Normally, LF is an termination - action event. Normally, CR is not used except by the server. - Message text also supports UTF8 characters if the utf8 nation - flag is turn on otherwise only the ASCII characters are - supported. +How to program your host to let you use Unicode in your client's user +interface is platform dependent, and beyond the scope of this +document. +Wolfpack's empclient supports UTF-8 if it runs in a terminal that +understands UTF-8. See its manual page. -Notes: -1. Strings that considered message text are commented. +Implementation Notes +-------------------- -2. Both Normal and Message text are char strings are in the server. - Care needs to be taken as some compiler consider char - signed and other default to unsigned char. +A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in +member flags of struct player. Session option utf-8 manipulates this +flag. -3. Unicode functions are prefixed with u. +Input and output filtering code is in src/lib/subs/pr.c. -Notes for Client Implementors: +Almost all code is untouched, almost all strings are still normal +text. Use of the other encodings is commented (well, we tried!). -ASCII Mode +The following commands and features have been changed to cope with +Unicode: -1. If you do not specify a login options, it the server will start the -session in ASCII mode. +* telegram, announce, pray and turn -2. This is close to the previous mode (<4.2.21) but there is more filtering -to remove non-ASCII characters and ASCII control characters. + Get the text of the message as UTF-8. Actually, the commands + themselves weren't affected, only the common code to get a + telex-like message, getele(). -3. If another client in UTF8 mode tries to send to this client then the -server will replace the non-ASCII characters with question marks. +* read, wire and the automatic display of MOTD and game-down messages -4. The standout works the same as before where the 8th bit indicates that -the character should be highlighted. + Display messages as UTF-8. The changes were limited to common + message display code. Its entry point was renamed from prnf() to + uprnf(), to make its unusual UTF-8 argument a bit more obvious. -UTF8 Mode +* flash and wall -1. The login options must be specified before the play command is sent. -The syntax is 'options utf-8'. + Retrieve the text to send as UTF-8, and send it to the recipient(s) + with output filtering appropriate for their session applied. -2. The server will filter ASCII control characters but will pass any characters -with the 8 bit set. + The text to send can be given on the command line itself. In this + case it has to be fetched from the raw command line buffer. Since + that buffer is already UTF-8, no change was required. -3. For the standout mode, the server inserts an ASCII SO character at the -beginning of standout sequence and the server sends an ASCII SI character at -the end of the standout sequence. + The commands can also prompt for the text to send. A new function + ugetstring() patterned after existing getstring() takes care of + that. + + Output filtering is handled by pr_flash(). However, flash and wall + break long lines, and that required some changes for UTF-8. + Breaking long lines there is probably a bad idea. -- 2.43.0