2 Codesets and Codeset Conversion

The Tru64 UNIX operating system fully supports the following Korean codesets by including locales and codeset conversion support:

DEC Korean

Korean EUC (Extended UNIX Code)

UTF-8

It also provides codeset conversion support for the following codesets:

KSC5601 (Unified Hangul)

ISO-2022-KR

UCS-4

2.1 DEC Korean

The ASCII, KSC5636-1993 (KS Roman), and KSC5601-1992 character sets (excluding the additional Hangul characters defined an Annex 3 of the standard) are combined to form the DEC Korean codeset, which is denoted as deckorean.

DEC Korean uses a two-byte data representation for symbols and ideographic characters defined in KSC5601-1992. To differentiate KSC5601-1992 characters from ASCII, the most significant bit (MSB) of both bytes of KSC5601 characters is always set on.

Figure 2-1: Representations of DEC Korean Characters

Representations of ASCII and Two-Byte Characters

The first byte of a two-byte code determines its row number, while the second determines its column number. The following formula illustrates the code of a two-byte KSC5601 character in relation to its row and column numbers:

1st byte = A0 + row number

2nd byte = A0 + column number

For example, if a character is at the first column of the 36th row, its encoded value is calculated as follows:

1st byte = A0 (hex) + 36 = C4 (hex)

2nd byte = A0 (hex) + 01 = A1 (hex)

In this case, the character code is C4A1.

Figure 2-2 illustrates the division of a two-byte code space and the position of KSC5601-1992 characters.

Figure 2-2: Code Space for KSC5601-1992

2.2 Korean EUC

Extended UNIX Code (EUC) is an encoding methodology that allows concurrent use of up to four code sets in a data stream. Korean EUC uses that method to combine ASCII and KSC5601. Korean EUC is currently identical to DEC Korean, and is denoted as eucKR.

2.3 KSC5601 (Unified Hangul)

Microsoft has developed Unified Hangul Code (UHC) also known as "Extended Wansung" for its Windows 95 operating system. It is an optional character set of Win95K. Microsoft calls this Code Page 949.

Unified Hangul provides full compatibility with KSC5601-1992 EUC encoding, but adds additional encoding ranges to hold additional precombined Hangul characters (more precisely, the 8,822 that are needed to fully support the Johab character set). The following table provides the encoding ranges for UHC encoding:

Two-Byte Standard Characters

Encoding Ranges

First byte range

0x81-0xFE

Second byte ranges

0x41-0x5A, 0x61-0x7A
and 0x81-0xFE

Two-Byte Standard Characters	Encoding Ranges
First byte range	0x81-0xFE
Second byte ranges	0x41-0x5A, 0x61-0x7A and 0x81-0xFE

One-Byte Characters

Encoding Range

ASCII

0x21-0x7E

One-Byte Characters	Encoding Range
ASCII	0x21-0x7E

Note that the encoding ranges 0xA1A1 through 0xFEFE are identical in terms of character-to-code allocation with KSC5601-1992 in EUC Encoding.

2.4 ISO-2022-KR

The ISO-2022-KR codeset consists of the following character sets:

ASCII

KSC5601-1992

It is assumed that the starting code of the text is ASCII. ASCII and Korean characters are distinguished by use of the shift function. For example, the code SO indicates that the upcoming bytes are Korean characters as defined in KSC5601. To return to ASCII the SI code is used.

Therefore, the escape sequence, shift function and character set used in a text are as follows:

Control Sequence	Character Set
SO	KSC5601-1992
SI	ASCII
ESC $ ) C	Appears once in the beginning of a line before any appearance of SO characters

Currently, the ISO-2022-KR codeset can be used in codeset conversion.

2.5 UCS-4/UTF-16

The UCS character set is a standard character encoding for the universal character set (UCS) specified in the Unicode and ISO/IEC 10646 standards. There are two encoding schemes for UCS. An implementation that parses in 16-bit units (2 octet units) is known as UTF-16. This is the canonical Unicode encoding in wide use on personal computers. An implementation that parses in 32-bit units (4 octet units) is know as UCS-4. This is the canonical ISO/IEC 10646 encoding that is in use on systems that can support larger data size units.

On Tru64 UNIX, UTF-16 and UCS-4 encoding can be used for codeset conversion. In addition, UCS-4 is used as an internal process code for some locales. For information about codeset conversion, see Section 2.7. For information about locales, see Chapter 3.

2.6 UTF-8

Unicode and ISO/IEC 10646 standards define transformation formats for the universal character set. For the most part, the following UCS transformation formats (UTFs) exist to transform UCS values into sequences of bytes to be handled by various byte-oriented protocols:

UTF-8, the standard method for transforming UCS-encoded data into a sequence of 8-bit bytes and ensuring interchange transparency for characters from the ASCII character set (code positions 0 through 127).

UTF-7, the standard interchange format for environments that strip the eighth bit from each byte.

UTF-16, a transformation format that allows systems that can process only 16-bit units to support the extended character definition space that is included in UCS-4.

The the operating system supports UTF-8 and UTF-16. UTF-8 can be used in codeset conversion and in locales. For information about codeset conversion, see Section 2.7. For information about locale variants, see Chapter 3.

2.7 Codeset Conversion

The iconv utility provided by Tru64 UNIX converts the encoding of characters in one codeset to another and writes the results to standard output. Korean codeset converters provided are shown in Table 2-1.

Table 2-1: Codeset Conversion

	DEC Korean	Korean EUC	ISO-2022-KR	KSC5601/cp949	UTF-16	UCS-4	UTF-8
DEC Korean	-	Y	N	Y	Y	Y	Y
Korean EUC	Y	-	Y	N	N	N	N
ISO-2022-KR	N	Y	-	Y	N	N	N
KSC5601/cp949	Y	N	Y	-	Y	Y	Y
UTF-16	Y	N	N	Y	-	Y	Y
UCS-4	Y	N	N	Y	Y	-	Y
UTF-8	Y	N	N	Y	Y	Y	-

For example, you can enter the following command to convert a DEC Korean file to a Korean UTF-8 file:

% iconv -f deckorean -t UTF-8 <file>

Table 2-2 shows the codesets and the strings you use as parameters to the iconv utility.

Table 2-2: Codeset Names

Codeset	Parameter String
DEC Korean	deckorean
Korean EUC	eucKR
ISO-2022-KR	ISO-2022-KR, iso-2022-kr
Unified Hangul	KSC5601,cp949
Universal Codeset	UTF-16, UCS-4
Universal Transfer Format	UTF-8

2.8 Codeset for Peripheral Devices

The operating system provides a mechanism by which you configure your system to run applications with peripherals, such as terminals and printers, supporting different codesets. You can specify the codesets for the applications, terminals, and printers independently as shown in Table 2-3. The operating sytem software automatically does the necessary codeset conversion.

Table 2-3: Feasible Korean Codeset for Applications, Terminals, and Printers

Application Code	Terminal Code	Printer Code
DEC Korean	DEC Korean	DEC Korean
Korean EUC	Korean EUC	Korean EUC
UTF-8	UTF-8

Note

The dxterm terminal emulator utility does not support UTF-8 as a terminal code. Use the dtterm terminal emulator utility when UTF-8 is required for a terminal code.

For details about setting up terminal code and printer code, see Using International Software.