0% found this document useful (0 votes)
22 views33 pages

AU14C04-Codepages and DB2

The document discusses character sets, encodings, and code page conversion as it relates to DB2. It provides details on defining code pages at the operating system, database, and application levels. It also covers where and when code page conversion occurs between a client and server, and potential issues that can arise from conversion.

Uploaded by

schock.903777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views33 pages

AU14C04-Codepages and DB2

The document discusses character sets, encodings, and code page conversion as it relates to DB2. It provides details on defining code pages at the operating system, database, and application levels. It also covers where and when code page conversion occurs between a client and server, and potential issues that can arise from conversion.

Uploaded by

schock.903777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

#IDUG

Code sets, NLS and


character conversion vs. DB2

Roland Schock
ARS Computer und Consulting GmbH
Session Code: C04
2014-09-10 | Platform: LUW
#IDUG
2

Overview

• What are character sets, encoding schemes and code pages?


• Where can I define the code page used?
• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations
#IDUG
3

Character Sets

• Basically a character set is just a collection of entities or


graphical symbols with a meaning.
• Examples for character sets are the latin alphabet, digits, naval
flag signs or other symbols:

A, B, C, ... ᇹぁゆ ㌹ ㌺
agpx
A b c d 亹怔떟떥
#IDUG
4

Character Encoding

• A character encoding or code page is a mapping of symbols of


a character set to bit patterns which are also referred as code
points.
A → 17, B → 23, C → 42, …
• Typical examples of encodings are ASCII, EBCDIC or Unicode.

• Part of the encoding scheme is also the definition


of a serialisation scheme to convert
the code point into a sequence of bytes.
#IDUG
5

ASCII

• Sample of an encoding scheme:

• First version 1963, Standardized 1968


• Ordered mapping to 7-bit numbers
#IDUG
6

Single Byte Char Sets (SBCS)

• Extensions from 7-bit ASCII to 8-bit code pages


• ISO-8859-x: ASCII + special characters for some languages
• ISO-8859-1 (Latin 1): ASCII + Westeuropean Chars
• ISO-8859-2 (Latin 2): ASCII + Easteuropean Chars
• ISO-8859-15: Modified ISO-8859-1 including Euro-Symbol (€)
• Platform specific charsets: Windows ANSI or MacRoman
#IDUG
7

Double Byte Char Sets (DBCS)

• Expansion of the SBCS concept from one byte to two bytes per
character
• Mainly used for asiatic languages with more than 256
characters to encode
• Latin text is expanded to twice the size of SBCS
#IDUG
8

EUC (Extended Unix Code)

• Multi Byte Char Set (MBCS): 2 or 4 bytes/char


• Only used for Japanese, Korean, Traditional and Simplified
Chinese on Unix platform
• Uses single shift characters to switch to a another code group
to build a multi byte character
#IDUG
9

Unicode

• Intended to simplify and unify the different definitions of code


pages and hence conversion.
• The first definition contained 65536 characters
(16-bit, 1991, UCS-2).
• Version 2.0 extended the charset with 16 planes for up to
1.114.112 characters
(32-bit, 1996, UCS-4).
• Today in Unicode Version 4.0 we have approx. 100.000
characters assigned to code points.
#IDUG
10

Unicode char sets and encodings

• UCS-2: two bytes per character


• UCS-4: four bytes per character
• UTF-16: Encoding of UCS-4 into one or two words: the first 64k
code points use two bytes per character, all others four byte
• UTF-8: dynamic or variable length encoding of characters with
one to four bytes
• Possible problems with UCS-2, UCS-4, UTF-16:
Byte order differences (big-endian vs. little-endian) between
different processor architectures.
#IDUG
11

UTF-8

• Encoding in variable length sequence of bytes


• Simple recognition of multibyte chars
• Compact storage of text in latin chars
• Only the shortest encoding allowed
#IDUG
12

Overview

• What are character sets, encoding schemes and code pages?


• Where can I define the code page used?
• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations
#IDUG
13

Usage of a code page

• Code pages can be specified at different levels:


• At the operating system where the application runs
• At the operating system where the server runs
• At the operating system where the application is
prepared/bound
• At the database level
#IDUG
14

Default code page

• As default DB2 server and clients use the local settings of the
operating system or user:
• Windows: The server process is using the default region settings of the
operating system.
• Linux/Unix: The codepage is derived from the locale setting for the
instance user (i.e. the user running the database processes).
• Client (LUW): The current locale settings of the user determine the code
page used during CONNECT.
• Programming language: Java is always using Unicode when connecting
to a database via JDBC.
#IDUG
15

Specifying a code page: OS level

• Windows: Control Panel → Regional and Language settings,


chcp command
• Linux/Unix: locale command
#IDUG
16

At prepare/bind time

• Special case during development of database software with


static, embedded SQL.
• Embedded SQL needs a prepare phase before compilation of
the source code.
• Later the prepared package needs to be bound to the database
with the bind command.
• Both commands need a database connection and at the
connect time; the current setting of the locale is used.
#IDUG
17

Defining a database w/ code page

• Explicitly set the code page at creation time:


CREATE DB test USING CODESET codeset
TERRITORY territory COLLATE collatingseq
• Otherwise current locale is used to determine database
codeset.
• The choosen code page cannot be changed later.
• In DB2 for iSeries and for z/OS you can also define single
columns of a table in a different code set (not detailed here).
#IDUG
18

Overview

• What are character sets, encoding schemes and code pages?


• Where can I define the code page used?
• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations
#IDUG
19

Code page conversion

• If application and server use a different code page, code page


conversion happens.
• Code page conversion is always done at the receivers side:
• at the servers side for data sent from client to server
• at the clients side for data sent from server to client
• Exception: Importing IXF files generated on a different system
with another code page
• If conversion tables are missing: SQLCODE -332
#IDUG

Client to server conversion

Client Server
uses code page X uses code page Y

§ Send data using § Receive data


code page X § Convert to code page Y
§ Process data
§ Receive data in Y § Return result in code page Y
§ Convert to code
page X
#IDUG
21

Using DB2 Connect

Client Gateway Server


uses code page X uses code page Y uses code page Z

§ Send data using § Receive data


code page X § Convert to code
page Y
§ Send data in Y § Receive data
§ Convert to code
page Z
§ Receive data in Z § Return result in
§ Convert to Y code page Z
§ Receive data in Y § Return result in Y
§ Convert to code
page X
#IDUG
22

Other considerations

• Mapping of characters (injective):


If a character in the source code page is not contained in the
target code page, it is replaced by a substitution character.
• Round trip conversion (bijective):
If no substitution needs to take place between source and
target code pages, a round trip conversion does not loose
information.
• Encoding/Decoding can change the number of bytes needed to
store the data.
#IDUG
23

More considerations

• Using different conversion tables and €-Symbol:


Microsoft ANSI code page and the official code page 850 have
a different code point for the Euro symbol. If needed code
conversion tables can be replaced (ref. Administration Guide,
Planning).
• Unicode support:
DB2 supports the UCS-2 character set with UTF-8 and UCS-2
encoding for Unicode databases
• For PureXML (V9.x) a UTF-8 database is needed.
#IDUG
24

More considerations

• To change a code page of a database, you have to use


db2move (Export/Import). Backup/Restore cannot be used. So
choosing the right database code page during database
creation is crucial.
• Binary data (BLOB, FOR BIT DATA) is internally stored with code
page 0, so no character conversion is applied.
#IDUG
25

Overview

• What are character sets, encoding schemes and code pages?


• Where can I define the code page used?
• What is code page conversion and where does it happen?
• What problems can arise and how can I avoid them?
• Performance considerations
#IDUG
26

Troubleshooting

• Identify used code pages:


• db2 get db cfg for sample
Retrieves database code page
• Displaying SQLCA area during CONNECT with CLP
When connecting to a database via CLP the option "–a"
displays the SQLCA data area, which shows the code page of
the database and the connecting client.
• If connecting to iSeries or zSeries machines from DB2 LUW,
check if conversion tables are available.
#IDUG
27

Pitfalls

• Watch out for unintentional "conversions"


• All database communication partners are configured correct,
but the DBA is looking via a console window at the data and the
console window (or putty) is using a font with the wrong codepage
to display the data!
#IDUG
28

db2set DB2CODEPAGE

• Know what you intend to do, if you use the DB2 environment
variable DB2CODEPAGE
• It tells DB2 you will feed it with the right code points regardless
of the displayed symbols.

• See Technote "Setting DB2CODEPAGE=1208 may result in


incorrect character data insertion"
SQL0191N Error occurred because of a fragmented MBCS character.
http://www.ibm.com/support/docview.wss?uid=swg21601028
#IDUG
29

db2set DB2CONSOLECP

• Intended to allow DB2 CLI to use different codepages for


output:

• Multiple APARs for DB2 9.1, 9.5, 9.7:


"DB2CONSOLECP environment variable has no effect on DB2
message text or is not working"
#IDUG
30

DB2 Special Registers for NLS

• Change message text for DB2 Monreport modules:


db2 "SET CURRENT LOCALE LC_MESSAGES = 'de_DE'"
db2 "call monreport.lockwait"
• Change message names for Time/Dates:
db2 "SET CURRENT LOCALE LC_TIME = 'fr_FR'"
db2 "values monthname(current date)"
(Works with DAYNAME, MONTHNAME, NEXT_DAY, ROUND, ROUND_TIMESTAMP,
TIMESTAMP_FORMAT, TRUNCATE, TRUNC_TIMESTAMP and VARCHAR_FORMAT)
#IDUG
31

Performance considerations

• Try to avoid unneccessary conversions.


• Create databases already with the code page needed for your
applications.
• For international databases prefer UTF-8, especially when used
with Java programs.
• Remember: Conversion takes time.
#IDUG
32

Links

• IBM developerworks white paper:


http://www.ibm.com/developerworks/db2/library/techarticle/dm-0506chong/index.html

• DB2 Infocenter
http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp

• Unicode
http://www.unicode.org

• UTF-8 article at Wikipedia


http://en.wikipedia.org/wiki/UTF-8
#IDUG

Roland Schock
ARS Computer und Consulting GmbH
[email protected]

C04
Code sets, NLS and character conversion vs. DB2

You might also like