Oracle 5.0 Reference Manual page 2921

Table of Contents

Advertisement

MySQL 5.0 FAQ: MySQL Chinese, Japanese, and Korean Character Sets
www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. This is not the first table you will find by navigating
from the
unicode.org
the more recent 4.1.0 table. (The newer
This is because we are very wary about changing ordering which affects indexes, lest we bring about
situations such as that reported in Bug #16526, illustrated as follows:
mysql<
CREATE TABLE tj (s1 CHAR(1) CHARACTER SET utf8 COLLATE utf8_unicode_ci);
Query OK, 0 rows affected (0.05 sec)
mysql>
INSERT INTO tj VALUES ('が'),('か');
Query OK, 2 rows affected (0.00 sec)
Records: 2
Duplicates: 0
mysql>
SELECT * FROM tj WHERE s1 = 'か';
+------+
| s1
|
+------+
| が
|
| か
|
+------+
2 rows in set (0.00 sec)
The character in the first result row is not the one that we searched for. Why did MySQL retrieve it?
First we look for the Unicode code point value, which is possible by reading the hexadecimal number
for the
version of the characters:
ucs2
mysql>
SELECT s1, HEX(CONVERT(s1 USING ucs2)) FROM tj;
+------+-----------------------------+
| s1
| HEX(CONVERT(s1 USING ucs2)) |
+------+-----------------------------+
| が
| 304C
| か
| 304B
+------+-----------------------------+
2 rows in set (0.03 sec)
Now we search for
304B
304B
; [.1E57.0020.000E.304B] # HIRAGANA LETTER KA
304C
; [.1E57.0020.000E.304B][.0000.0140.0002.3099] # HIRAGANA LETTER GA; QQCM
The official Unicode names (following the "#" mark) tell us the Japanese syllabary (Hiragana),
the informal classification (letter, digit, or punctuation mark), and the Western identifier
which happen to be voiced and unvoiced components of the same letter pair). More importantly, the
primary weight (the first hexadecimal number inside the square brackets) is
comparisons in both searching and sorting, MySQL pays attention to the primary weight only, ignoring
all the other numbers. This means that we are sorting
specification. If we wanted to distinguish them, we'd have to use a non-UCA (Unicode Collation
Algorithm) collation
(utf8_bin
use
ORDER BY CONVERT(s1 USING
course: the person who submitted the bug was equally correct. We plan to add another collation for
Japanese according to the JIS X 4061 standard, in which voiced/unvoiced letter pairs like
distinguishable for ordering purposes.
B.11.15: Why do CJK strings sort incorrectly in Unicode? (II)
If you are using Unicode
Section B.11, "MySQL 5.0 FAQ: MySQL Chinese, Japanese, and Korean Character
still seems to sort your table incorrectly, then you should first verify the table character set:
mysql>
SHOW CREATE TABLE t\G
******************** 1. row ******************
Table: t
Create Table: CREATE TABLE `t` (
`s1` char(1) CHARACTER SET ucs2 DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
home page, because MySQL uses the older 4.0.0 "allkeys" table, rather than
'520'
Warnings: 0
|
|
and
in the
304C
4.0.0 allkeys
or utf8_general_ci), or to compare the
sjis). Being correct "according to Unicode" isn't enough, of
(ucs2
or utf8), and you know what the Unicode sort order is (see
2901
collations in MySQL 5.6 use the 5.2 "allkeys" table.)
table, and find these lines:
and
correctly according to the Unicode
(KA
or GA,
on both lines. For
1E57
[889]
values, or
HEX()
are
KA/GA
Sets"), but MySQL

Advertisement

Table of Contents
loading

This manual is also suitable for:

Mysql 5.0

Table of Contents