2 Codesets and Codeset Conversion

The operating system fully supports the following Chinese codesets by including locales and codeset conversion support:

DEC Hanyu (Section 2.1)

Taiwanese EUC, Extended UNIX Code (Section 2.2)

Big-5 (Section 2.3)

DEC Hanzi (Section 2.4)

GBK (Section 2.5)

GB18030 (Section 2.6)

UTF-8 (Section 2.10)

The operating system also provides conversion support for the following codesets:

Telecode (Section 2.8)

Shift Big-5 (Section 2.7)

UTF-16 (Section 2.9)

UCS-4 (Section 2.9)

HKSCS (Section 2.11.4 )

2.1 DEC Hanyu

The DEC Hanyu codeset, denoted by dechanyu, consists of the following character sets:

ASCII

CNS 11643, the first and second character planes

DTSCS

User-Defined Characters

DEC Hanyu uses a combination of single-byte, 2-byte, and 4-byte data to represent ASCII characters, symbols, or ideographic characters.

2.1.1 ASCII CODE

All ASCII characters can be represented in the form of single-byte 7-bit data in DEC Hanyu. That is, the Most Significant Bit (MSB) of ASCII characters is always set off.

2.1.2 CNS 11643 Code

Each CNS 11643 character is represented by a 2-byte code in DEC Hanyu, which complies with the CNS 11643 standard. The MSB of the first byte is always set on while that of the second byte can be on for the first character plane or off for the second character plane. See Figure 2-1.

Figure 2-1: DEC Hanyu Encoding of CNS 11643 Planes

DEC Hanyu Encoding of CNS 11643 Planes

The first byte of a CNS 11643 code determines the row number of the character, while the second byte determines its column number. Table 2-1 illustrates the code range of a CNS 11643 code.

Table 2-1: CNS 11643 Code Range in DEC Hanyu

Character Plane	1st Byte (hexadecimal)	2nd Byte (hexadecimal)
Plane 1	A1 to FE	A1 to FE
Plane 2	A1 to FE	21 to 7E

The following formulas illustrate the code of a CNS 11643 character in relation to its row and column numbers.

CNS 11643 Plane 1 character:

First byte = A0 + row number
Second byte = A0 + column number

CNS 11643 Plane 2 character:

First byte = A0 + row number
Second byte = 20 + column number

For example, if a character is positioned at the first column of the 36th row on CNS 11643 Plane 1, its encoding value is calculated as follows:

First byte = A0 (hex) + 36 = C4 (hex)
Second byte = A0 (hex) + 01 = A1 (hex)

Its encoded value is C4A1.

Similarly, if a character is positioned at the first column of the 36th row on CNS 11643 Plane 2, its encoding value is calculated as follows:

First byte = A0 (hex) + 36 = C4 (hex)
Second byte = 20 (hex) + 01 = 21 (hex)

Its encoded value is C421.

Figure 2-2 illustrates the division of a 2-byte code space and the position of CNS 11643 characters.

Figure 2-2: Code Space for CNS 11643 in DEC Hanyu

Code Space for CNS 11643 in DEC Hanyu

2.1.3 DTSCS Code

Each DTSCS character is represented by a 4-byte code in DEC Hanyu. The first two bytes are the leading codes, namely 0xC2 and 0xCB, which are used as a designator sequence for the DTSCS character set. The MSB of the third and fourth bytes is set on for the EDPC Recommended Character Set. See Figure 2-4.

Figure 2-3: DEC Hanyu Encoding of DTSCS Characters

Figure 2-4 illustrates the 4-byte code space and the position of DTSCS characters.

Figure 2-4: Code Space For DTSCS In DEC Hanyu

Code Space For DTSCS In DEC Hanyu

2.1.4 User Defined Characters

In addition to the CNS11643 and the DTSCS character sets described above, DEC Hanyu provides 3,587 positions for User Defined Characters (UDC). The positions for UDCs are those unused (but not reserved) code points on the CNS 11643 first and second character planes. Therefore, the encoding of UDC is exactly the same as that of CNS 11643 except that they occupy different regions, as shown in Table 2-2.

Table 2-2: UDC Code Range in DEC Hanyu

Character Plane	Number of UDC	Code Range
Plane 1	145	FDCC - FEFE
Plane 1	2,256	AAA1 - C1FE
Plane 2	1,186	F245 - FE7E

2.2 Taiwanese EUC

Taiwanese EUC (Extended UNIX Code), denoted as eucTW, is another codeset to support CNS 11643. The design of Taiwanese EUC allows the 16 character planes of CNS 11643 to be encoded in a unified way. A stream of data encoded in Taiwanese EUC can contain characters defined in ASCII and the 16 character planes. Figure 2-5 illustrates the encoding of Taiwanese EUC.

Figure 2-5: Encoding of Taiwanese EUC

Encoding of Taiwanese EUC

Taiwanese EUC uses the Single-Shift 2 control character (SS2) and an additional byte to specify a character plane. The only exception is the first plane, which does not require leading codes. Instead, two bytes specify a character's position on the first plane. The first byte determines its row number, while the second determines its column number. The MSBs of the two bytes are set on.

In this release, only the characters defined in the first and second planes of CNS 11643 and those in the EDPC Recommended Character Set that have been remapped into the third and fourth character planes of the revised CNS 11643-1992 are supported in Taiwanese EUC. Other characters that were added to the CNS 11643-1992 standard are not supported.

2.3 Big-5

The Big-5 codeset, denoted as big5 is the only codeset that supports the Big-5 character set. The encoding of the Big-5 codeset is similar to that of CNS 11643 in DEC Hanyu. Each Big-5 character is represented by a 2-byte code, which complies with the Big-5 standard. The MSB of the first byte is always set on while that of the second byte can be set on or off.

The Big-5 code range is defined as shown in Table 2-3.

Table 2-3: Big-5 Code Range

Character	Number of Characters	Code Range
Special symbols	408	A140-A3BF
Level 1 characters	5,401	A440-C67E
Level 2 characters	7,652	C940-F9D5

The operating system supports codeset conversion for HKSCS (the Hong Kong Supplementary Character Set) and uses Big-5 encoding for HKSCS representation. HKSCS characters map to BIG-5 in the range of 8840 to FEFE. See Section 2.11.4 for more information on HKSCS codeset conversion.

In addition to the code points for special symbols and Chinese characters shown in Table 2-3, three areas are set aside for user defined spaces. Some vendors in Taiwan support user defined characters in the code ranges shown in Table 2-4.

Table 2-4: Big-5 User Defined Spaces

Character	Number of Character	Code Range
Level 1 user defined space	785	FA40-FEFE
Level 2 user defined space	2,983	8E40-A0FE
Level 3 user defined space	2,041	8140-8DFE

The valid ranges of the two bytes are:

Byte	Valid Ranges
First byte	81-FE
Second byte	40-7E and A1-FE

Figure 2-6 illustrates the encoding of the Big-5 codeset in a 2-byte code space.

Figure 2-6: Code Space for Big-5

Code Space for Big-5

2.4 DEC Hanzi

The ASCII, GB2312-80 and extended GB character sets are combined to form the DEC Hanzi codeset.

DEC Hanzi, or simplified Chinese and denoted as dechanzi, uses a 2-byte data representation for symbols and ideographic characters defined in the GB2312-80 character set. To differentiate GB2312-80 codes from ASCII codes, the MSB of the first byte is always set on while that of the second byte is on for GB2312-80 and off for extended GB, as shown in Figure 2-7.

Figure 2-7: DEC Hanzi Character Encoding

DEC Hanzi Character Encoding

The first byte of a two-byte code determines its row number, while the second byte determines its column number.

The following formulas illustrate the code of a GB2312-80 character or an extended GB character in relation to its row and column numbers.

GB2312-80 character:

First byte = A0 + row number
Second byte = A0 + column number

Extended GB character:

First byte = A0 + row number
Second byte = 20 + column number

For example, if a character is positioned at the first column of the 16th row on the GB2312-80 code plane, its encoding value is calculated as follows:

First byte = A0 (hex) + 16 = B0 (hex)
Second byte = A0 (hex) + 01 = A1 (hex)

The resulting encoded value is B0A1.

Similarly, if a character is positioned at the first column of the 16th row on the extended GB code plane, its encoding value is calculated as follows:

First byte = A0 (hex) + 16 = B0 (hex)
Second byte = 20 (hex) + 01 = 21 (hex)

The resulting encoded value is B021.

Figure 2-8 illustrates the division of a 2-byte code space and the position of the Chinese character sets.

Figure 2-8: GB2312-80 and Extended GB Code Space

GB2312-80 and Extended GB Code Space

2.5 GBK

The GBK codeset is a character encoding system for simplified Chinese.

The codeset provides a total of 23,940 code points, 21,886 of which are assigned. Each row in the GBK code table consists of 190 characters. ASCII characters, which are single-byte characters, are defined in the range 0x21 to 0x7E. Encoding ranges for 2-byte characters are 0x81 to 0xFE for the first byte and 0x40 to 0x7E and 0x80 to 0xFE for the second byte.

In terms of character-to-code allocation, the sub-range for GB2321-80 characters (0xA1A1-0xFEFE) in GBK is the same encoding range defined for these characters in EUC. GBK is therefore backward compatible with Chinese EUC coding as well as forward compatible with the encoding defined in the ISO 10646 standard.

The GBK codeset is divided into five levels as follows:

Level
Encoding Range
Code Points
Characters

GBK/1
0xA1A1 to 0xA9FE
846
717

GBK/2
0xB0A1 to 0xF7FE
6768
6763

GBK/3
0x8140 to 0xA0FE
6080
6080

GBK/4
0xAA40 to 0xFE40
8160
8160

GBK/5
0xA840 to 0xA9A0
192
166

Level	Encoding Range	Code Points	Characters
GBK/1	0xA1A1 to 0xA9FE	846	717
GBK/2	0xB0A1 to 0xF7FE	6768	6763
GBK/3	0x8140 to 0xA0FE	6080	6080
GBK/4	0xAA40 to 0xFE40	8160	8160
GBK/5	0xA840 to 0xA9A0	192	166

In addition, the GBK codeset includes the following codepoints for user-defined characters:

Encoding Range
Code Points

0xAAA1 to 0xAFFE
564

0xF8A1 to 0xFEFE
658

0xA140 to 0xA7A0
672

Encoding Range	Code Points
0xAAA1 to 0xAFFE	564
0xF8A1 to 0xFEFE	658
0xA140 to 0xA7A0	672

The operating system provides the following codeset converter pairs for converting simplified Chinese characters between GBK and Unicode formats (UTF-16, UCS-4, and UTF-8):

UTF-16_GBK and GBK_UTF-16
UCS-4_GBK and GBK_UCS-4
UTF-8_GBK and GBK_UTF-8

2.6 GB18030

The GB18030 codeset provides 1-byte, 2-byte, and 4-byte encoding with the following structure:

Number of Bytes
Encoding Range
Code Points

1 byte
0x00 to 0x7f
128

2 byte
0x81 to 0xfe
0x40 to 0xfe (except 0x7f)
23940

4 byte
0x81 to 0xfe
0x30 to 0x39
0x81 to 0xfe
0x30 to 0x39
1587600

Number of Bytes	Encoding Range	Code Points
1 byte	0x00 to 0x7f	128
2 byte	0x81 to 0xfe 0x40 to 0xfe (except 0x7f)	23940
4 byte	0x81 to 0xfe 0x30 to 0x39 0x81 to 0xfe 0x30 to 0x39	1587600

GB18030 1-byte code supports ASCII characters.

GB18030 2-byte code supports all the CJK characters (Chinese, Japanese, Korean) in the Unicode Version 2.1 Standard.

GB18030 4-byte code supports Unicode Version 3.0 additions. The 4-byte code also leaves a large number of unassigned code points available for future use.

The operating system provides the following codeset converter pairs for converting simplified Chinese characters between GB18030 and Unicode formats (UTF-16, UCS-4, and UTF-8):

UTF-16_GB18030 and GB18030_UTF-16
UCS-4_GB18030 and GB18030_UCS-4
UTF-8_GB18030 and GB18030_UTF-8

Note
The GB18030-2000 character set maps the invalid Unicode code points (U+FFFE and U_FFFF) to 4-byte codes. Because these two code points are invalid in UCS, this mapping can cause problems with conversions between the two character sets. Also, the GB18030-2000 character set does no mapping from 4-byte code to the UCS surrogate area (U+D800 to U+DFFF).

2.7 Shift Big-5

The Shift Big-5 codeset, denoted as sbig5, is a variant of the Big-5 codeset. The difference between the two is that the second byte of some Big-5 characters is mapped to other values to form Shift Big-5 characters. Table 2-5 illustrates the mappings of Big-5 characters to Shift Big-5 characters.

Table 2-5: Big-5 to Shift Big-5 Character Mappings

Big-5 (Second Byte)	Shift Big-5 (Second Byte)
40	30
5B	31
5C	32
5D	33
5E	34
5F	35
60	36
7B	37
7C	38
7D	39
7E	9F

The Shift Big-5 codeset can be used in codeset conversion and terminal display. See Section 2.11 for details.

2.8 Telecode

The Telecode codeset (called Mitac Telex in earlier versions of the operating system), denoted as telecode, consists of 2 character planes. Each character plane has 8836 character positions. In Plane 1, standard characters occupy positions 0001 to 8045; the remaining 791 positions are for user-defined characters. In Plane 2, standard characters occupy positions 0001 to 8489; the remaining 346 positions are for user-defined characters. Telecode uses 2-byte values to represent characters on both planes.

Note
For information about the character sets encoded by Telecode, see the Chinese Code For Data Communication.

Telecode can be used in codeset conversion and terminal display. See Section 2.11 for further details.

2.8.1 Plane 1 Character Encoding

To differentiate Plane 1 code from Plane 2 code, the MSB is set on in both bytes of a Plane 1 character code. You can use the following formula to calculate the value of a Plane 1 character from its position on the plane:

First byte = M + 161

Second byte = N + 161 - M x 94

In this formula, N is the position of the character and M = N / 94.

For example, if a character is at position 2502 on Plane 1, its encoded value is BBDB, which is calculated as follows:

N = 2502, M = 2502/94 = 26

First byte = 26 + 161 = 187 (or, BB (hex))

Second byte = 2502 + 161 - 26 x 94 = 219 (or, DB (hex))

2.8.2 Plane 2 Character Encoding

To differentiate Plane 2 code from Plane 1 code, the MSB of the first byte is set on and the second byte is set off for each Plane 2 character code. You can use the following formula to calculate the value of a Plane 2 character from its position:

First byte = M + 161

Second byte = N + 33 - M x 94

In this formula, N is the position of the character on the plane and M = N / 94.

For example, if a character is at position 2502 on Plane 2, its encoded value is BB5B, which is calculated as follows:

N = 2502, M = 2502/94 = 26

First byte = 26 + 161 = 187 (or, bb (hex))

Second byte = 2502 + 33 - 26 x 94 = 91 (or, 5B (hex))

2.9 UCS-4/UTF-16

The UCS codeset is a standard character encoding for the universal character set (UCS) specified in Unicode and ISO/IEC 10646. There are two encoding schemes for UCS. An implementation that parses in 16-bit units (2 octets) is known as UTF-16. This is the canonical Unicode encoding in wide use on personal computers. An implementation that parses in 32-bit units (4 octets) is known as UCS-4. This is the canonical ISO/IEC 10646 encoding that is in use on systems that can support larger data unit size.

On Tru64 UNIX, UTF-16 and UCS-4 can be used in codeset conversion. In addition, UCS-4 is used as internal process code for some locales. For codeset conversion, see Section 2.11. For locale variants, see Chapter 3.

2.10 UTF-8

The Unicode and ISO/IEC 10646 standards define transformation formats for the UCS. The following UCS transformation formats (UTFs) exist mainly to transform UCS values into sequences of bytes for handling by various byte-oriented protocols:

UTF-8 is the standard method for transforming UCS-encoded data into a sequence of 8-bit bytes and ensuring interchange transparency for characters in C0 code positions (0 to 31), the SPACE (32) character, and the DEL (127) character.

UTF-7 is the standard interchange format for environments that strip the eighth bit from each byte.

UTF-16 is a transformation format that allows systems that are limited to processing of 16-bit units to support the extended character definition space that is included in UCS-4.

The operating system supports UTF-8 and UTF-16. UTF-8 can be used in codeset conversion and in locales. For codeset conversion, see Section 2.11. For locale variants, see Chapter 3.

2.11 Codeset Conversion

You may sometimes need to convert files from one codeset to another. Use the iconv utility to convert the encoding of characters in one codeset to another and write the results to standard output. Table 2-6 shows the pairs of Chinese codeset converters that are provided.

Table 2-6: Chinese Codeset Conversion

Conversion _from_Big-5_to	Conversion _from_Taiwan EUC_to
DEC Hanyu	DEC Hanyu
Taiwan EUC	Big-5
Shift Big-5	Shift Big-5
Telecode	Telecode
DEC Hanzi	DEC Hanzi
UCS-4	UCS-4
UTF-16	UTF-16
UTF-8	UTF-8
Conversion _from_DEC Hanyu_to	Conversion _from_DEC Hanzi_to
Taiwan EUC	DEC Hanyu
Big-5	Taiwan EUC
Telecode	Shift Big-5
DEC Hanzi	UCS-4
UCS-4	UTF-16
UTF-16	UTF-8
UTF-8	-
Conversion _from_UTF-8_to	Conversion _from_UCS-4_to
DEC Hanyu	DEC Hanyu
Taiwan EUC	Taiwan EUC
Big-5	Big-5
DEC Hanzi	DEC Hanzi
HKSCS	HKSCS
UCS-4	UTF-16
UTF-16	UTF-8
Conversion _from_GBK_to_	Converson _from_GB18030_to_
UCS-4	UCS-4
UTF-16	UTF-16
UTF-8	UTF-8
Conversion _from_Telecode_to_	Conversion _from_UTF-16_to_
DEC Hanyu	DEC Hanyu
Taiwan EUC	Taiwan EUC
Big-5	Big-5
-	HKSCS
-	UCS-4
-	UTF-8
Conversion _from_Shift Big-5_to_	-
Taiwan EUC	-
Big-5	-

For example, the following command converts a DEC Hanyu file to Big-5:

% iconv -f dechanyu -t big5 <file>

Table 2-7 shows the various string names you can use as the parameters of the iconv utility.

Table 2-7: Codeset Names and Associated Strings

Codeset	String
DEC Hanyu	dechanyu
Taiwanese EUC	eucTW
Big-5	big5
Shift Big-5	sbig5
Telecode	telecode
DEC Hanzi	dechanzi
GBK	GBK
GB18030	GB18030
Hong Kong Supplementary Character Set	HKSCS
Universal Codeset (4-octet form)	UCS-4
Universal Transfer Format (16-bit)	UTF-16
Universal Transfer Format (8-bit)	UTF-8

2.11.1 Default Conversion String

When converting from one codeset to another, characters in the source codeset that have no corresponding code point in the destination codeset are not converted. By default, the characters that cannot be converted are skipped and have no representation in the converted output.

You can control this behavior by using the ICONV_DEFSTR environment variable to define a default string to replace those unconvertible characters. If you specify a numeric value for this environment variable, the corresponding character value will be used.

The ICONV_DEFSTR environment variable affects all Chinese iconv converters. You can also use the "ICONV_DEFSTR_<from_code>_<to_code>" environment variable to control specific codeset conversion. For example to convert a DEC Hanyu input file to DEC Hanzi with unconvertible characters converted to "?", you would enter the following commands:

 %setenv ICONV_DEFSTR_dechanyu_dechanzi 
"?" %iconv -f dechanyu -t dechanzi hanzi_input > hanyu_output

For codeset converters that end in UTF-16, UCS-4, or UTF-8, you can use the "U+XXXX" notation to specify the default character for conversion failure fallback.

Note
During cut-and-paste operations, those traditional Chinese characters that cannot be converted to simplified Chinese characters are shown as default characters in the applications.

2.11.2 One-to-Many Conversion

When converting from the DEC Hanzi codeset to other Chinese codesets, one simplified Chinese character may be mapped to multiple traditional Chinese characters. By default, the iconv utility picks up only the most likely candidate from a list of possible choices. You can control the behavior of the iconv utility with the ICONV_ACTION environment variables.

The ICONV_ACTION environment variable determines how the iconv utility behaves when there are one-to-many mappings. The possible values are:

batch
The most likely, or preferred, candidate will be picked up. This is the default. During cut-and-paste operations, the batch mode is always used for one-to-many character mappings.

conv_all
All possible choices are generated within the brackets "{" and "}" so that you can edit the converted file manually and determine which one should be used.

conv_all_nosym
All characters except symbols (for example, punctuation marks) are handled in the same manner as conv_all.

Note
The ICONV_ACTION environment variable applies only to conversions of simplified Chinese to traditional Chinese and has no effect on UCS-4 and UTF-8 converters.

2.11.3 User Defined Character Mappings

Some user defined characters in the Big-5 codeset have predefined mappings to user defined spaces in DEC Hanyu. Table 2-8 shows this mapping.

Table 2-8: Mapping Between Big-5 and DEC Hanyu User Defined Characters

DEC Hanyu	Big-5	Code Size
F321 - FB41	FA40 - FEFE	785
FB42 - FEFE	8E40 - 905C	343
AAA1 - C1FE	905D - 9EB8	2256

These predefined user defined character mappings are supported by both the iconv methods and the terminal driver.

Some user defined characters do not have predefined mappings. You should use only those user defined characters that have predefined mappings.

2.11.4 Hong Kong Supplementary Character Set

HKSCS, the Hong Kong Supplementary Character Set, was developed by the Chinese government, in collaboration with the Chinese Language Interface Advisory Committee, to provide Chinese characters needed in computing in Hong Kong. The characters contained in HKSCS are only for computer use and can be represented as either Big-5 or Unicode.

While HKSCS is not a character set name, the operating system uses it as the name for the extended Big-5 encoding that contains the HKSCS characters. Currently, HKSCS support is limited to codeset conversation between HKSCS and Unicode.

See HKSCS.5 for more information.

2.12 Codeset for Peripheral Devices

The operating system provides a mechanism for you to use to configure your system to run applications with peripherals, such as terminals and printers, that support different codesets. You can specify the codesets for the applications, terminals, and printers independently as shown in Table 2-9. The operating system software automatically does the necessary codeset conversion.

Table 2-9: Feasible Chinese Codesets for Applications, Terminals, and Printers

Application Code	Terminal Code	Printer Code
DEC Hanyu	DEC Hanyu	DEC Hanyu
Taiwanese EUC	Taiwanese EUC	Taiwanese EUC
Big-5	Big-5	Big-5
DEC Hanzi	DEC Hanzi	DEC Hanzi
none	Shift Big-5	none
none	Telecode	none
UTF-8	UTF-8	none

Chinese DECterm software supports only DEC Hanyu, Big5, or DEC Hanzi as its terminal code. You must activate the stty drive and set tcode to dechanyu when running in a Taiwanese EUC locale. For example:

 %stty adec tcode dechanyu

The dxterm does not support UTF-8 as a terminal code. Use dtterm when UTF-8 is required for a terminal code.

For details about setting up codesets for terminals and printers, see Using International Software.