Rewritten in an attempt to present all the initial revision's valuable
information in a more accessible form, and then some.
This commit is contained in:
parent
39b3493851
commit
5a25fd93c5
1 changed files with 145 additions and 116 deletions
259
doc/unicode
259
doc/unicode
|
@ -1,134 +1,163 @@
|
||||||
Unicode changes:
|
Empire has been extended to optionally work with Unicode. This file
|
||||||
|
documents the design and implementation of this change.
|
||||||
1. login utf-8
|
|
||||||
|
|
||||||
Added a login options. The first option is utf-8 and it sets
|
|
||||||
the PF_UTF8 player's flags. Default is off.
|
|
||||||
Syntax
|
|
||||||
options utf-8 -- turns on the utf-8
|
|
||||||
options utf-8=1 -- turns on the utf-8
|
|
||||||
options utf-8=0 -- turns off the utf-8
|
|
||||||
options -- lists current options and their values
|
|
||||||
|
|
||||||
2. flash and wall
|
|
||||||
|
|
||||||
a. Message as command argument
|
|
||||||
|
|
||||||
Interpret raw command line as message text rather than normal
|
|
||||||
text.
|
|
||||||
|
|
||||||
b. Multi-line mode
|
|
||||||
|
|
||||||
Read message lines as message text rather than normal text.
|
|
||||||
|
|
||||||
c. Break long lines
|
|
||||||
|
|
||||||
Count the charactes using utf8 format. This works for both ASCII
|
|
||||||
and UTF8 formatted strings.
|
|
||||||
|
|
||||||
d. Print lines
|
|
||||||
|
|
||||||
Print as message text rather than normal text.
|
|
||||||
|
|
||||||
3. Telexes and telex-like things
|
|
||||||
|
|
||||||
a. read and wire, MOTD and gamedown message
|
|
||||||
|
|
||||||
Print as message text rather than normal text.
|
|
||||||
|
|
||||||
c. tele, anno, pray, turn.
|
|
||||||
|
|
||||||
Read as message text rather than normal text.
|
|
||||||
|
|
||||||
4. Input filtering
|
|
||||||
|
|
||||||
a. Parsing commands (normal text)
|
|
||||||
|
|
||||||
Ignore control and non-ASCII characters when copying argument
|
|
||||||
strings.
|
|
||||||
|
|
||||||
b. Reading normal text command arguments
|
|
||||||
|
|
||||||
Replace control and non-ASCII characters, except for tab with
|
|
||||||
"?'.
|
|
||||||
|
|
||||||
c. Reading message text command arguments
|
|
||||||
|
|
||||||
Support message text arguments, used by 3a. and 2b. Replace
|
|
||||||
control and, if NF_UTF8 is off, non-ASCII characters.
|
|
||||||
|
|
||||||
5. Output filtering
|
|
||||||
|
|
||||||
Output filtering asssumes that there are no control characters or
|
|
||||||
invalid characters in the output messages. The control characters
|
|
||||||
and invalid characters are filtered out during input filtering or
|
|
||||||
that the server will not generate control characters or invalid
|
|
||||||
characters.
|
|
||||||
|
|
||||||
a. Printing normal text
|
|
||||||
|
|
||||||
When NF_UTF8 is on, highlighted text is printed using SO/SI.
|
|
||||||
|
|
||||||
b. Printing message text
|
|
||||||
|
|
||||||
When NF_UTF8 is off, replace UTF8 charactes with '?'.
|
|
||||||
|
|
||||||
|
|
||||||
Definitions:
|
Traditional Empire Character Set
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
1. Normal Text
|
Empire has always used plain ASCII. It abused the most significant
|
||||||
For normal text, the following ASCII characters are valid:
|
bit for highlighting. Some commands cleared this bit from some input,
|
||||||
CR, LF and 0x20-0x7e. Normally, LF is an termination action
|
others didn't.
|
||||||
event. Normally, CR is not used except by the server.
|
|
||||||
Normal Text does not support UTF8 characters. In normal
|
|
||||||
text, the 8th bit is used a highlight bit. If the client
|
|
||||||
has the utf8 nation flag set, the standout bit is removed
|
|
||||||
and the highlight block is prefixed with SO (ASCII standout)
|
|
||||||
and suffixed with SI (ASCII standin).
|
|
||||||
|
|
||||||
2. Message Text
|
The restriction to the archaic ASCII character set bothered some
|
||||||
For message text, the following ASCII characters are valid:
|
players. It is barely serviceable for most western languages other
|
||||||
Tab, CR, LF and 0x020-0x7e. Normally, LF is an termination
|
than English, and useless for everything else. This is unbecoming for
|
||||||
action event. Normally, CR is not used except by the server.
|
a game played around the world.
|
||||||
Message text also supports UTF8 characters if the utf8 nation
|
|
||||||
flag is turn on otherwise only the ASCII characters are
|
|
||||||
supported.
|
|
||||||
|
|
||||||
|
|
||||||
Notes:
|
What is Unicode?
|
||||||
|
----------------
|
||||||
|
|
||||||
1. Strings that considered message text are commented.
|
Unicode is the emerging standard character set for all multi-lingual
|
||||||
|
applications. The core of Unicode is identical to the Universal
|
||||||
|
Character Set (UCS) defined in ISO 10646.
|
||||||
|
|
||||||
2. Both Normal and Message text are char strings are in the server.
|
UCS is 31-bit. The most commonly used characters are the range
|
||||||
Care needs to be taken as some compiler consider char
|
0-0xFFFD, the so called Basic Multilingual Plane (BMP).
|
||||||
signed and other default to unsigned char.
|
|
||||||
|
|
||||||
3. Unicode functions are prefixed with u.
|
A character set can be encoded in different ways. Popular encodings
|
||||||
|
are UCS-4 (four byte wide characters), UCS-2 (two byte wide
|
||||||
|
characters; can't represent characters outside the BMP directly), and
|
||||||
|
UTF-8 (a multibyte encoding).
|
||||||
|
|
||||||
Notes for Client Implementors:
|
UTF-8 has a few desirable properties. In particular, it is a
|
||||||
|
compatible extension of plain (7-bit) ASCII: every ASCII string is
|
||||||
|
also a valid UTF-8 string, and a plain ASCII byte (an octet with the
|
||||||
|
most significant bit clear) in an UTF-8 string always encodes the
|
||||||
|
ASCII character, i.e. it is never part of a multibyte sequence.
|
||||||
|
|
||||||
ASCII Mode
|
To learn more, see the Unicode FAQ, currently at
|
||||||
|
http://www.cl.cam.ac.uk/~mgk25/unicode.html
|
||||||
|
|
||||||
1. If you do not specify a login options, it the server will start the
|
|
||||||
session in ASCII mode.
|
|
||||||
|
|
||||||
2. This is close to the previous mode (<4.2.21) but there is more filtering
|
Requirements for Unicode Support in Empire
|
||||||
to remove non-ASCII characters and ASCII control characters.
|
------------------------------------------
|
||||||
|
|
||||||
3. If another client in UTF8 mode tries to send to this client then the
|
* Full backward compatibility to existing clients
|
||||||
server will replace the non-ASCII characters with question marks.
|
|
||||||
|
|
||||||
4. The standout works the same as before where the 8th bit indicates that
|
* Easy to support for clients
|
||||||
the character should be highlighted.
|
|
||||||
|
|
||||||
UTF8 Mode
|
* Minimal impact on server code; no additional portability headaches
|
||||||
|
|
||||||
1. The login options must be specified before the play command is sent.
|
* Interoperability between old and new clients and servers
|
||||||
The syntax is 'options utf-8'.
|
|
||||||
|
|
||||||
2. The server will filter ASCII control characters but will pass any characters
|
|
||||||
with the 8 bit set.
|
|
||||||
|
|
||||||
3. For the standout mode, the server inserts an ASCII SO character at the
|
Principles of Design
|
||||||
beginning of standout sequence and the server sends an ASCII SI character at
|
--------------------
|
||||||
the end of the standout sequence.
|
|
||||||
|
Client/server communications uses what we call external encoding:
|
||||||
|
either traditional Empire ASCII (7-bit ASCII plus highlighting bit) or
|
||||||
|
UTF-8. The choice between the encoding is under the control of the
|
||||||
|
client, and defaults to Empire ASCII. The chosen encoding is a
|
||||||
|
property of the session; it doesn't carry over the future sessions.
|
||||||
|
Highlighting is only supported for output (from server to client).
|
||||||
|
Highlighting in UTF-8 is done with control characters: ASCII SO (Shift
|
||||||
|
Out, C-n, decimal 14) starts highlighting, and ASCII SI (Shift In,
|
||||||
|
C-o, decimal 15) stops highlighting. Text encoded in the client's
|
||||||
|
external encoding is called user text.
|
||||||
|
|
||||||
|
There are two internal encodings. We use UTF-8 for player-player
|
||||||
|
communication, and Empire ASCII for everything else. Most of the
|
||||||
|
time, there's no difference, because ASCII is valid UTF-8. The
|
||||||
|
exception is where the highlighting bit can be used. We call such
|
||||||
|
text normal text.
|
||||||
|
|
||||||
|
Input from the client needs to be translated from the client's
|
||||||
|
external encoding into internal encoding. We call this input
|
||||||
|
filtering. Since highlighting is not supported on input, the result
|
||||||
|
is always valid UTF-8. Commands retrieve input that is player-player
|
||||||
|
communication directly as UTF-8. Other input is retrieved as ASCII,
|
||||||
|
which replaces non-ASCII characters by '?'[1].
|
||||||
|
|
||||||
|
Input filtering from UTF-8 drops ASCII control characters except
|
||||||
|
'\t' and '\n'.
|
||||||
|
|
||||||
|
Input filtering from ASCII additionally replaces non-ASCII characters
|
||||||
|
by '?'. The result is plain ASCII, which is also valid UTF-8.
|
||||||
|
|
||||||
|
Output to the client needs to be translated from internal encoding to
|
||||||
|
the client's external encoding. We call this output filtering. It is
|
||||||
|
integrated into the printing functions, i.e. the functions for sending
|
||||||
|
output to the client. Most of them accept normal text. Some accept
|
||||||
|
UTF-8, and some only plain ASCII; all of these are clearly documented.
|
||||||
|
|
||||||
|
Output filtering to ASCII doesn't change normal text. In UTF-8 text,
|
||||||
|
it replaces non-ASCII characters by '?'.
|
||||||
|
|
||||||
|
Output filtering to UTF-8 doesn't change UTF-8 text. In normal text,
|
||||||
|
it strips highlighting bits and inserts SI/SO control characters in
|
||||||
|
their place.
|
||||||
|
|
||||||
|
|
||||||
|
Notes for Clients
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
Clients use session option utf-8 during login to switch the session to
|
||||||
|
UTF-8. Highlighting is done differently in UTF-8 sessions. Consult
|
||||||
|
doc/clients-howto for details.
|
||||||
|
|
||||||
|
An ASCII session should work just like previous server versions,
|
||||||
|
except for the treatment of control and non-ASCII characters. We
|
||||||
|
believe the new behavior makes more sense.
|
||||||
|
|
||||||
|
How to program your host to let you use Unicode in your client's user
|
||||||
|
interface is platform dependent, and beyond the scope of this
|
||||||
|
document.
|
||||||
|
|
||||||
|
Wolfpack's empclient supports UTF-8 if it runs in a terminal that
|
||||||
|
understands UTF-8. See its manual page.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation Notes
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
A session uses UTF-8 rather than Empire ASCII if PF_UTF8 is set in
|
||||||
|
member flags of struct player. Session option utf-8 manipulates this
|
||||||
|
flag.
|
||||||
|
|
||||||
|
Input and output filtering code is in src/lib/subs/pr.c.
|
||||||
|
|
||||||
|
Almost all code is untouched, almost all strings are still normal
|
||||||
|
text. Use of the other encodings is commented (well, we tried!).
|
||||||
|
|
||||||
|
The following commands and features have been changed to cope with
|
||||||
|
Unicode:
|
||||||
|
|
||||||
|
* telegram, announce, pray and turn
|
||||||
|
|
||||||
|
Get the text of the message as UTF-8. Actually, the commands
|
||||||
|
themselves weren't affected, only the common code to get a
|
||||||
|
telex-like message, getele().
|
||||||
|
|
||||||
|
* read, wire and the automatic display of MOTD and game-down messages
|
||||||
|
|
||||||
|
Display messages as UTF-8. The changes were limited to common
|
||||||
|
message display code. Its entry point was renamed from prnf() to
|
||||||
|
uprnf(), to make its unusual UTF-8 argument a bit more obvious.
|
||||||
|
|
||||||
|
* flash and wall
|
||||||
|
|
||||||
|
Retrieve the text to send as UTF-8, and send it to the recipient(s)
|
||||||
|
with output filtering appropriate for their session applied.
|
||||||
|
|
||||||
|
The text to send can be given on the command line itself. In this
|
||||||
|
case it has to be fetched from the raw command line buffer. Since
|
||||||
|
that buffer is already UTF-8, no change was required.
|
||||||
|
|
||||||
|
The commands can also prompt for the text to send. A new function
|
||||||
|
ugetstring() patterned after existing getstring() takes care of
|
||||||
|
that.
|
||||||
|
|
||||||
|
Output filtering is handled by pr_flash(). However, flash and wall
|
||||||
|
break long lines, and that required some changes for UTF-8.
|
||||||
|
Breaking long lines there is probably a bad idea.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue