Содержание
- 9.1. Character Set Support
- 9.1.1. Character Sets and Collations in General
- 9.1.2. Character Sets and Collations in MySQL
- 9.1.3. Specifying Character Sets and Collations
- 9.1.4. Connection Character Sets and Collations
- 9.1.5. Configuring the Character Set and Collation for Applications
- 9.1.6. Character Set for Error Messages
- 9.1.7. Collation Issues
- 9.1.8. String Repertoire
- 9.1.9. Operations Affected by Character Set Support
- 9.1.10. Unicode Support
- 9.1.11. Upgrading from Previous to Current Unicode Support
- 9.1.12. UTF-8 for Metadata
- 9.1.13. Column Character Set Conversion
- 9.1.14. Character Sets and Collations That MySQL Supports
- 9.2. Setting the Error Message Language
- 9.3. Adding a Character Set
- 9.4. Adding a Collation to a Character Set
- 9.5. Character Set Configuration
- 9.6. MySQL Server Time Zone Support
- 9.7. MySQL Server Locale Support
This chapter covers issues of globalization, which includes internationalization (MySQL's capabilities for adapting to local use) and localization (selecting particular local conventions):
MySQL support for character sets in SQL statements.
How to configure the server to support different character sets.
Selecting the language for error messages.
How to set the server's time zone and enable per-connection time zone support.
Selecting the locale for day and month names.
- 9.1.1. Character Sets and Collations in General
- 9.1.2. Character Sets and Collations in MySQL
- 9.1.3. Specifying Character Sets and Collations
- 9.1.4. Connection Character Sets and Collations
- 9.1.5. Configuring the Character Set and Collation for Applications
- 9.1.6. Character Set for Error Messages
- 9.1.7. Collation Issues
- 9.1.8. String Repertoire
- 9.1.9. Operations Affected by Character Set Support
- 9.1.10. Unicode Support
- 9.1.11. Upgrading from Previous to Current Unicode Support
- 9.1.12. UTF-8 for Metadata
- 9.1.13. Column Character Set Conversion
- 9.1.14. Character Sets and Collations That MySQL Supports
MySQL includes character set support that enables you to store
data using a variety of character sets and perform comparisons
according to a variety of collations. You can specify character
sets at the server, database, table, and column level. MySQL
supports the use of character sets for the
MyISAM
, MEMORY
,
NDBCLUSTER
, and
InnoDB
storage engines.
This chapter discusses the following topics:
What are character sets and collations?
The multiple-level default system for character set assignment.
Синтаксис for specifying character sets and collations.
Affected functions and operations.
Unicode support.
The character sets and collations that are available, with notes.
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If you
want the client program to communicate with the server using a
character set different from the default, you'll need to indicate
which one. For example, to use the utf8
Unicode
character set, issue this statement after connecting to the
server:
SET NAMES 'utf8';
For more information about configuring character sets for application use and character set-related issues in client/server communication, see Section 9.1.5, “Configuring the Character Set and Collation for Applications”, and Section 9.1.4, “Connection Character Sets and Collations”.
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters:
“A
”,
“B
”,
“a
”,
“b
”. We give each letter a
number: “A
” = 0,
“B
” = 1,
“a
” = 2,
“b
” = 3. The letter
“A
” is a symbol, the number 0 is
the encoding for
“A
”, and the combination of all
four letters and their encodings is a
character set.
Suppose that we want to compare two string values,
“A
” and
“B
”. The simplest way to do this
is to look at the encodings: 0 for
“A
” and 1 for
“B
”. Because 0 is less than 1,
we say “A
” is less than
“B
”. What we've just done is
apply a collation to our character set. The collation is a set
of rules (only one rule in this case): “compare the
encodings.” We call this simplest of all possible
collations a binary collation.
But what if we want to say that the lowercase and uppercase
letters are equivalent? Then we would have at least two rules:
(1) treat the lowercase letters
“a
” and
“b
” as equivalent to
“A
” and
“B
”; (2) then compare the
encodings. We call this a
case-insensitive collation. It is a
little more complex than a binary collation.
In real life, most character sets have many characters: not just
“A
” and
“B
” but whole alphabets,
sometimes multiple alphabets or eastern writing systems with
thousands of characters, along with many special symbols and
punctuation marks. Also in real life, most collations have many
rules, not just for whether to distinguish lettercase, but also
for whether to distinguish accents (an “accent” is
a mark attached to a character as in German
“Ö
”), and for
multiple-character mappings (such as the rule that
“Ö
” =
“OE
” in one of the two German
collations).
MySQL can do these things for you:
Store strings using a variety of character sets.
Compare strings using a variety of collations.
Mix strings with different character sets or collations in the same server, the same database, or even the same table.
Enable specification of character set and collation at any level.
In these respects, MySQL is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
The MySQL server can support multiple character sets. To list
the available character sets, use the SHOW
CHARACTER SET
statement. A partial listing follows.
For more complete information, see
Section 9.1.14, “Character Sets and Collations That MySQL Supports”.
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset | Описание | Default collation | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |
| dec8 | DEC West European | dec8_swedish_ci | 1 |
| cp850 | DOS West European | cp850_general_ci | 1 |
| hp8 | HP West European | hp8_english_ci | 1 |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci | 1 |
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| swe7 | 7bit Swedish | swe7_swedish_ci | 1 |
| ascii | US ASCII | ascii_general_ci | 1 |
| ujis | EUC-JP Japanese | ujis_japanese_ci | 3 |
| sjis | Shift-JIS Japanese | sjis_japanese_ci | 2 |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci | 1 |
| tis620 | TIS620 Thai | tis620_thai_ci | 1 |
| euckr | EUC-KR Korean | euckr_korean_ci | 2 |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci | 1 |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |
| greek | ISO 8859-7 Greek | greek_general_ci | 1 |
| cp1250 | Windows Central European | cp1250_general_ci | 1 |
| gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
...
Any given character set always has at least one collation. It
may have several collations. To list the collations for a
character set, use the SHOW
COLLATION
statement. For example, to see the
collations for the latin1
(cp1252 West
European) character set, use this statement to find those
collation names that begin with latin1
:
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 1 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 1 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+---------------------+---------+----+---------+----------+---------+
The latin1
collations have the following
meanings.
Collation | Meaning |
---|---|
latin1_german1_ci | German DIN-1 |
latin1_swedish_ci | Swedish/Finnish |
latin1_danish_ci | Danish/Norwegian |
latin1_german2_ci | German DIN-2 |
latin1_bin | Binary according to latin1 encoding |
latin1_general_ci | Multilingual (Western European) |
latin1_general_cs | Multilingual (ISO Western European), case sensitive |
latin1_spanish_ci | Modern Spanish |
Collations have these general characteristics:
Two different character sets cannot have the same collation.
Each character set has one collation that is the default collation. For example, the default collation for
latin1
islatin1_swedish_ci
. The output forSHOW CHARACTER SET
indicates which collation is the default for each displayed character set.There is a convention for collation names: They start with the name of the character set with which they are associated, they usually include a language name, and they end with
_ci
(case insensitive),_cs
(case sensitive), or_bin
(binary).
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
Collation-Charts.Org is a useful site for information that shows how one collation compares to another.
- 9.1.3.1. Server Character Set and Collation
- 9.1.3.2. Database Character Set and Collation
- 9.1.3.3. Table Character Set and Collation
- 9.1.3.4. Column Character Set and Collation
- 9.1.3.5. Character String Literal Character Set and Collation
- 9.1.3.6. National Character Set
- 9.1.3.7. Examples of Character Set and Collation Assignment
- 9.1.3.8. Compatibility with Other DBMSs
There are default settings for character sets and collations at four levels: server, database, table, and column. The description in the following sections may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.
CHARACTER SET
is used in clauses that specify
a character set. CHARSET
can be used as a
synonym for CHARACTER SET
.
Character set issues affect not only data storage, but also
communication between client programs and the MySQL server. If
you want the client program to communicate with the server using
a character set different from the default, you'll need to
indicate which one. For example, to use the
utf8
Unicode character set, issue this
statement after connecting to the server:
SET NAMES 'utf8';
For more information about character set-related issues in client/server communication, see Section 9.1.4, “Connection Character Sets and Collations”.
MySQL Server has a server character set and a server collation. These can be set at server startup on the command line or in an option file and changed at runtime.
Initially, the server character set and collation depend on
the options that you use when you start
mysqld. You can use
--character-set-server
for the
character set. Along with it, you can add
--collation-server
for the
collation. If you don't specify a character set, that is the
same as saying
--character-set-server=latin1
.
If you specify only a character set (for example,
latin1
) but not a collation, that is the
same as saying
--character-set-server=latin1
--collation-server=latin1_swedish_ci
because latin1_swedish_ci
is the default
collation for latin1
. Therefore, the
following three commands all have the same effect:
shell>mysqld
shell>mysqld --character-set-server=latin1
shell>mysqld --character-set-server=latin1 \
--collation-server=latin1_swedish_ci
One way to change the settings is by recompiling. To change
the default server character set and collation when building
from sources, use the
DEFAULT_CHARSET
and
DEFAULT_COLLATION
options for
CMake. For example:
shell> cmake . -DDEFAULT_CHARSET=latin1
Or:
shell>cmake . -DDEFAULT_CHARSET=latin1 \
-DDEFAULT_COLLATION=latin1_german1_ci
Both mysqld and CMake verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.
The server character set and collation are used as default
values if the database character set and collation are not
specified in CREATE DATABASE
statements. They have no other purpose.
The current server character set and collation can be
determined from the values of the
character_set_server
and
collation_server
system
variables. These variables can be changed at runtime.
Every database has a database character set and a database
collation. The CREATE DATABASE
and ALTER DATABASE
statements
have optional clauses for specifying the database character
set and collation:
CREATE DATABASEdb_name
[[DEFAULT] CHARACTER SETcharset_name
] [[DEFAULT] COLLATEcollation_name
] ALTER DATABASEdb_name
[[DEFAULT] CHARACTER SETcharset_name
] [[DEFAULT] COLLATEcollation_name
]
The keyword SCHEMA
can be used instead of
DATABASE
.
All database options are stored in a text file named
db.opt
that can be found in the database
directory.
The CHARACTER SET
and
COLLATE
clauses make it possible to create
databases with different character sets and collations on the
same MySQL server.
Пример:
CREATE DATABASE db_name
CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database collation in the following manner:
If both
CHARACTER SET
andX
COLLATE
are specified, character setY
X
and collationY
are used.If
CHARACTER SET
is specified withoutX
COLLATE
, character setX
and its default collation are used. To see the default collation for each character set, use theSHOW COLLATION
statement.If
COLLATE
is specified withoutY
CHARACTER SET
, the character set associated withY
and collationY
are used.Otherwise, the server character set and server collation are used.
The database character set and collation are used as default
values for table definitions if the table character set and
collation are not specified in CREATE
TABLE
statements. The database character set also is
used by LOAD DATA
INFILE
. The character set and collation have no
other purposes.
The character set and collation for the default database can
be determined from the values of the
character_set_database
and
collation_database
system
variables. The server sets these variables whenever the
default database changes. If there is no default database, the
variables have the same value as the corresponding
server-level system variables,
character_set_server
and
collation_server
.
Every table has a table character set and a table collation.
The CREATE TABLE
and
ALTER TABLE
statements have
optional clauses for specifying the table character set and
collation:
CREATE TABLEtbl_name
(column_list
) [[DEFAULT] CHARACTER SETcharset_name
] [COLLATEcollation_name
]] ALTER TABLEtbl_name
[[DEFAULT] CHARACTER SETcharset_name
] [COLLATEcollation_name
]
Пример:
CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation in the following manner:
If both
CHARACTER SET
andX
COLLATE
are specified, character setY
X
and collationY
are used.If
CHARACTER SET
is specified withoutX
COLLATE
, character setX
and its default collation are used. To see the default collation for each character set, use theSHOW COLLATION
statement.If
COLLATE
is specified withoutY
CHARACTER SET
, the character set associated withY
and collationY
are used.Otherwise, the database character set and collation are used.
The table character set and collation are used as default values for column definitions if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
Every “character” column (that is, a column of
type CHAR
,
VARCHAR
, or
TEXT
) has a column character
set and a column collation. Column definition syntax for
CREATE TABLE
and
ALTER TABLE
has optional
clauses for specifying the column character set and collation:
col_name
{CHAR | VARCHAR | TEXT} (col_length
) [CHARACTER SETcharset_name
] [COLLATEcollation_name
]
These clauses can also be used for
ENUM
and
SET
columns:
col_name
{ENUM | SET} (val_list
) [CHARACTER SETcharset_name
] [COLLATEcollation_name
]
Examples:
CREATE TABLE t1 ( col1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci ); ALTER TABLE t1 MODIFY col1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the column character set and collation in the following manner:
If both
CHARACTER SET
andX
COLLATE
are specified, character setY
X
and collationY
are used.CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci ) CHARACTER SET latin1 COLLATE latin1_bin;
The character set and collation are specified for the column, so they are used. The column has character set
utf8
and collationutf8_unicode_ci
.If
CHARACTER SET
is specified withoutX
COLLATE
, character setX
and its default collation are used.CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8 ) CHARACTER SET latin1 COLLATE latin1_bin;
The character set is specified for the column, but the collation is not. The column has character set
utf8
and the default collation forutf8
, which isutf8_general_ci
. To see the default collation for each character set, use theSHOW COLLATION
statement.If
COLLATE
is specified withoutY
CHARACTER SET
, the character set associated withY
and collationY
are used.CREATE TABLE t1 ( col1 CHAR(10) COLLATE utf8_polish_ci ) CHARACTER SET latin1 COLLATE latin1_bin;
The collation is specified for the column, but the character set is not. The column has collation
utf8_polish_ci
and the character set is the one associated with the collation, which isutf8
.Otherwise, the table character set and collation are used.
CREATE TABLE t1 ( col1 CHAR(10) ) CHARACTER SET latin1 COLLATE latin1_bin;
Neither the character set nor collation are specified for the column, so the table defaults are used. The column has character set
latin1
and collationlatin1_bin
.
The CHARACTER SET
and
COLLATE
clauses are standard SQL.
If you use ALTER TABLE
to
convert a column from one character set to another, MySQL
attempts to map the data values, but if the character sets are
incompatible, there may be data loss.
Every character string literal has a character set and a collation.
A character string literal may have an optional character set
introducer and COLLATE
clause:
[_charset_name
]'string
' [COLLATEcollation_name
]
Examples:
SELECT 'string
'; SELECT _latin1'string
'; SELECT _latin1'string
' COLLATE latin1_danish_ci;
For the simple statement SELECT
'
, the string has
the character set and collation defined by the
string
'character_set_connection
and
collation_connection
system
variables.
The
_
expression is formally called an
introducer. It tells the parser,
“the string that is about to follow uses character set
charset_name
X
.” Because this has
confused people in the past, we emphasize that an introducer
does not change the string to the introducer character set
like CONVERT()
would do. It
does not change the string's value, although padding may
occur. The introducer is just a signal. An introducer is also
legal before standard hex literal and numeric hex literal
notation
(x'
and
literal
'0x
), or
before bit-field literal notation
(nnnn
b'
and
literal
'0b
).
nnnn
Examples:
SELECT _latin1 x'AABBCC'; SELECT _latin1 0xAABBCC; SELECT _latin1 b'1100011'; SELECT _latin1 0b1100011;
MySQL determines a literal's character set and collation in the following manner:
If both
_X
andCOLLATE
are specified, character setY
X
and collationY
are used.If
_X
is specified butCOLLATE
is not specified, character setX
and its default collation are used. To see the default collation for each character set, use theSHOW COLLATION
statement.Otherwise, the character set and collation given by the
character_set_connection
andcollation_connection
system variables are used.
Examples:
A string with
latin1
character set andlatin1_german1_ci
collation:SELECT _latin1'Müller' COLLATE latin1_german1_ci;
A string with
latin1
character set and its default collation (that is,latin1_swedish_ci
):SELECT _latin1'Müller';
A string with the connection default character set and collation:
SELECT 'Müller';
Character set introducers and the COLLATE
clause are implemented according to standard SQL
specifications.
An introducer indicates the character set for the following
string, but does not change now how the parser performs escape
processing within the string. Escapes are always interpreted
by the parser according to the character set given by
character_set_connection
.
The following examples show that escape processing occurs
using
character_set_connection
even
in the presence of an introducer. The examples use
SET NAMES
(which changes
character_set_connection
, as
discussed in Section 9.1.4, “Connection Character Sets and Collations”), and
display the resulting strings using the
HEX()
function so that the
exact string contents can be seen.
Пример 1:
mysql>SET NAMES latin1;
Query OK, 0 rows affected (0.01 sec) mysql>SELECT HEX('à\n'), HEX(_sjis'à\n');
+------------+-----------------+ | HEX('à\n') | HEX(_sjis'à\n') | +------------+-----------------+ | E00A | E00A | +------------+-----------------+ 1 row in set (0.00 sec)
Here, “à
” (hex value
E0
) is followed by
“\n
”, the escape sequence for
newline. The escape sequence is interpreted using the
character_set_connection
value of latin1
to produce a literal
newline (hex value 0A
). This happens even
for the second string. That is, the introducer of
_sjis
does not affect the parser's escape
processing.
Пример 2:
mysql>SET NAMES sjis;
Query OK, 0 rows affected (0.00 sec) mysql>SELECT HEX('à\n'), HEX(_latin1'à\n');
+------------+-------------------+ | HEX('à\n') | HEX(_latin1'à\n') | +------------+-------------------+ | E05C6E | E05C6E | +------------+-------------------+ 1 row in set (0.04 sec)
Here,
character_set_connection
is
sjis
, a character set in which the sequence
of “à
” followed by
“\
” (hex values
05
and 5C
) is a valid
multi-byte character. Hence, the first two bytes of the string
are interpreted as a single sjis
character,
and the “\
” is not interpreted
as an escape character. The following
“n
” (hex value
6E
) is not interpreted as part of an escape
sequence. This is true even for the second string; the
introducer of _latin1
does not affect
escape processing.
Standard SQL defines NCHAR
or
NATIONAL CHAR
as a way to
indicate that a CHAR
column
should use some predefined character set. MySQL
5.5 uses utf8
as this
predefined character set. For example, these data type
declarations are equivalent:
CHAR(10) CHARACTER SET utf8 NATIONAL CHARACTER(10) NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8 NATIONAL VARCHAR(10) NCHAR VARCHAR(10) NATIONAL CHARACTER VARYING(10) NATIONAL CHAR VARYING(10)
You can use
N'
(or
literal
'n'
) to
create a string in the national character set. These
statements are equivalent:
literal
'
SELECT N'some text'; SELECT n'some text'; SELECT _utf8'some text';
For information on upgrading character sets to MySQL 5.5 from versions prior to 4.1, see the MySQL 3.23, 4.0, 4.1 Справочное Руководство.
The following examples show how MySQL determines default character set and collation values.
Пример 1: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci ) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
Here we have a column with a latin1
character set and a latin1_german1_ci
collation. The definition is explicit, so that is
straightforward. Notice that there is no problem with storing
a latin1
column in a
latin2
table.
Пример 2: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) CHARACTER SET latin1 ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
This time we have a column with a latin1
character set and a default collation. Although it might seem
natural, the default collation is not taken from the table
level. Instead, because the default collation for
latin1
is always
latin1_swedish_ci
, column
c1
has a collation of
latin1_swedish_ci
(not
latin1_danish_ci
).
Пример 3: Table and Column Definition
CREATE TABLE t1 ( c1 CHAR(10) ) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
We have a column with a default character set and a default
collation. In this circumstance, MySQL checks the table level
to determine the column character set and collation.
Consequently, the character set for column
c1
is latin1
and its
collation is latin1_danish_ci
.
Пример 4: Database, Table, and Column Definition
CREATE DATABASE d1 DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci; USE d1; CREATE TABLE t1 ( c1 CHAR(10) );
We create a column without specifying its character set and
collation. We're also not specifying a character set and a
collation at the table level. In this circumstance, MySQL
checks the database level to determine the table settings,
which thereafter become the column settings.) Consequently,
the character set for column c1
is
latin2
and its collation is
latin2_czech_ci
.
Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:
The server character set and collation are the values of the
character_set_server
andcollation_server
system variables.The character set and collation of the default database are the values of the
character_set_database
andcollation_database
system variables.
Additional character set and collation system variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation system variables.
A “connection” is what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets or error messages, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:
What character set is the statement in when it leaves the client?
The server takes the
character_set_client
system variable to be the character set in which statements are sent by the client.What character set should the server translate a statement to after receiving it?
For this, the server uses the
character_set_connection
andcollation_connection
system variables. It converts statements sent by the client fromcharacter_set_client
tocharacter_set_connection
(except for string literals that have an introducer such as_latin1
or_utf8
).collation_connection
is important for comparisons of literal strings. For comparisons of strings with column values,collation_connection
does not matter because columns have their own collation, which has a higher collation precedence.What character set should the server translate to before shipping result sets or error messages back to the client?
The
character_set_results
system variable indicates the character set in which the server returns query results to the client. This includes result data such as column values, and result metadata such as column names and error messages.
Clients can fine-tune the settings for these variables, or depend on the defaults (in which case, you can skip the rest of this section). If you do not use the defaults, you must change the character settings for each connection to the server.
Two statements affect the connection-related character set variables as a group:
SET NAMES '
charset_name
' [COLLATE 'collation_name
']SET NAMES
indicates what character set the client will use to send SQL statements to the server. Thus,SET NAMES 'cp1251'
tells the server, “future incoming messages from this client are in character setcp1251
.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use aSELECT
statement.)A
SET NAMES '
statement is equivalent to these three statements:x
'SET character_set_client =
x
; SET character_set_results =x
; SET character_set_connection =x
;Setting
character_set_connection
tox
also implicitly setscollation_connection
to the default collation forx
. It is unnecessary to set that collation explicitly. To specify a particular collation, use the optionalCOLLATE
clause:SET NAMES '
charset_name
' COLLATE 'collation_name
'SET CHARACTER SET
charset_name
SET CHARACTER SET
is similar toSET NAMES
but setscharacter_set_connection
andcollation_connection
tocharacter_set_database
andcollation_database
. ASET CHARACTER SET
statement is equivalent to these three statements:x
SET character_set_client =
x
; SET character_set_results =x
; SET collation_connection = @@collation_database;Setting
collation_connection
also implicitly setscharacter_set_connection
to the character set associated with the collation (equivalent to executingSET character_set_connection = @@character_set_database
). It is unnecessary to setcharacter_set_connection
explicitly.
ucs2
, utf16
, and
utf32
cannot be used as a client character
set, which means that they do not work for SET
NAMES
or SET CHARACTER SET
.
The MySQL client programs mysql
,
mysqladmin
, mysqlcheck
,
mysqlimport
, and mysqlshow
determine the default character set to use as follows:
In the absence of other information, the programs use the compiled-in default character set, usually
latin1
.The programs can autodetect which character set to use based on the operating system setting, such as the value of the
LANG
orLC_ALL
locale environment variable on Unix systems or the code page setting on Windows systems. For systems on which the locale is available from the OS, the client uses it to set the default character set rather than using the compiled-in default. For example, settingLANG
toru_RU.KOI8-R
causes thekoi8r
character set to be used. Thus, users can configure the locale in their environment for use by MySQL clients.The OS character set is mapped to the closest MySQL character set if there is no exact match. If the client does not support the matching character set, it uses the compiled-in default. For example,
ucs2
is not supported as a connection character set.C applications that wish to use character set autodetection based on the OS setting can invoke the following
mysql_options()
call before connecting to the server:mysql_options(mysql, MYSQL_SET_CHARSET_NAME, MYSQL_AUTODETECT_CHARSET_NAME);
The programs support a
--default-character-set
option, which enables users to specify the character set explicitly to override whatever default the client otherwise determines.
Before MySQL 5.5, in the absence of other
information, the MySQL client programs used the compiled-in
default character set, usually latin1
. An
implication of this difference is that if your environment is
configured to use a non-latin1
locale,
MySQL client programs will use a different connection
character set than previously, as though you had issued an
implicit SET
NAMES
statement. If the previous behavior is
required, start the client with the
--default-character-set=latin1
option.
When a client connects to the server, it sends the name of the
character set that it wants to use. The server uses the name to
set the character_set_client
,
character_set_results
, and
character_set_connection
system
variables. In effect, the server performs a SET
NAMES
operation using the character set name.
With the mysql client, if you want to use a
character set different from the default, you could explicitly
execute SET NAMES
every time you start up.
However, to accomplish the same result more easily, you can add
the --default-character-set
option
setting to your mysql command line or in your
option file. For example, the following option file setting
changes the three connection-related character set variables set
to koi8r
each time you invoke
mysql:
[mysql] default-character-set=koi8r
If you are using the mysql client with
auto-reconnect enabled (which is not recommended), it is
preferable to use the charset
command rather
than SET NAMES
. For example:
mysql> charset utf8
Charset changed
The charset
command issues a SET
NAMES
statement, and also changes the default
character set that mysql uses when it
reconnects after the connection has dropped.
Пример: Suppose that column1
is defined as
CHAR(5) CHARACTER SET latin2
. If you do not
say SET NAMES
or SET CHARACTER
SET
, then for SELECT column1 FROM
t
, the server sends back all the values for
column1
using the character set that the
client specified when it connected. On the other hand, if you
say SET NAMES 'latin1'
or SET
CHARACTER SET latin1
before issuing the
SELECT
statement, the server
converts the latin2
values to
latin1
just before sending results back.
Conversion may be lossy if there are characters that are not in
both character sets.
If you do not want the server to perform any conversion of
result sets or error messages, set
character_set_results
to
NULL
or binary
:
SET character_set_results = NULL;
To see the values of the character set and collation system variables that apply to your connection, use these statements:
SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
You must also consider the environment within which your MySQL applications execute. See Section 9.1.5, “Configuring the Character Set and Collation for Applications”.
For more information about character sets and error messages, see Section 9.1.6, “Character Set for Error Messages”.
For applications that store data using the default MySQL
character set and collation (latin1
,
latin1_swedish_ci
), no special configuration
should be needed. If applications require data storage using a
different character set or collation, you can configure
character set information several ways:
Specify character settings per database. For example, applications that use one database might require
utf8
, whereas applications that use another database might requiresjis
.Specify character settings at server startup. This causes the server to use the given settings for all applications that do not make other arrangements.
Specify character settings at configuration time, if you build MySQL from source. This causes the server to use the given settings for all applications, without having to specify them at server startup.
When different applications require different character settings, the per-database technique provides a good deal of flexibility. If most or all applications use the same character set, specifying character settings at server startup or configuration time may be most convenient.
For the per-database or server-startup techniques, the settings control the character set for data storage. Applications must also tell the server which character set to use for client/server communications, as described in the following instructions.
The examples shown here assume use of the
utf8
character set and
utf8_general_ci
collation.
Specify character settings per
database. To create a database such that its tables
will use a given default character set and collation for data
storage, use a CREATE DATABASE
statement like this:
CREATE DATABASE mydb DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
Tables created in the database will use utf8
and utf8_general_ci
by default for any
character columns.
Applications that use the database should also configure their
connection to the server each time they connect. This can be
done by executing a SET NAMES 'utf8'
statement after connecting. The statement can be used regardless
of connection method: The mysql client, PHP
scripts, and so forth.
In some cases, it may be possible to configure the connection to
use the desired character set some other way. For example, for
connections made using mysql, you can specify
the --default-character-set=utf8
command-line option to achieve the same effect as SET
NAMES 'utf8'
.
For more information about configuring client connections, see Section 9.1.4, “Connection Character Sets and Collations”.
If you change the default character set or collation for a
database, stored routines that use the database defaults must be
dropped and recreated so that they use the new defaults. (In a
stored routine, variables with character data types use the
database defaults if the character set or collation are not
specified explicitly. See Section 12.1.15, “CREATE PROCEDURE
and
CREATE FUNCTION
Синтаксис”.)
Specify character settings at server
startup. To select a character set and collation at
server startup, use the
--character-set-server
and
--collation-server
options. For
example, to specify the options in an option file, include these
lines:
[mysqld] character-set-server=utf8 collation-server=utf8_general_ci
These settings apply server-wide and apply as the defaults for databases created by any application, and for tables created in those databases.
It is still necessary for applications to configure their
connection using SET NAMES
or equivalent
after they connect, as described previously. You might be
tempted to start the server with the
--init_connect="SET NAMES 'utf8'"
option to cause SET NAMES
to be executed
automatically for each client that connects. However, this will
yield inconsistent results because the
init_connect
value is not
executed for users who have the
SUPER
privilege.
Specify character settings at MySQL
configuration time. To select a character set and
collation when you configure and build MySQL from source, use
the DEFAULT_CHARSET
and
DEFAULT_COLLATION
options for
CMake:
shell>cmake . -DDEFAULT_CHARSET=utf8 \
-DDEFAULT_COLLATION=utf8_general_ci
The resulting server uses utf8
and
utf8_general_ci
as the default for databases
and tables and for client connections. It is unnecessary to use
--character-set-server
and
--collation-server
to specify
those defaults at server startup. It is also unnecessary for
applications to configure their connection using SET
NAMES
or equivalent after they connect to the server.
Regardless of how you configure the MySQL character set for
application use, you must also consider the environment within
which those applications execute. If you will send statements
using UTF-8 text taken from a file that you create in an editor,
you should edit the file with the locale of your environment set
to UTF-8 so that the file encoding is correct and so that the
operating system handles it correctly. If you use the
mysql client from within a terminal window,
the window must be configured to use UTF-8 or characters may not
display properly. For a script that executes in a Web
environment, the script must handle character encoding properly
for its interaction with the MySQL server, and it must generate
pages that correctly indicate the encoding so that browsers know
how to display the content of the pages. For example, you can
include this <meta>
tag within your
<head>
element:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
This section describes how the server uses character sets for constructing error messages and returning them to clients. For information about the language of error messages (rather than the character set), see Section 9.2, “Setting the Error Message Language”.
As of MySQL 5.5, the server constructs error messages using
UTF-8 and returns them to clients in the character set specified
by the character_set_results
system variable.
The server constructs error messages as follows:
The message template uses UTF-8.
Parameters in the message template are replaced with values that apply to a specific error occurrence:
Identifiers such as table or column names use UTF-8 internally so they are copied as is.
Character (nonbinary) string values are converted from their character set to UTF-8.
Binary string values are copied as is for bytes in the range
0x20
to0x7E
, and using\x
hex encoding for bytes outside that range. For example, if a duplicate-key error occurs for an attempt to insert0x41CF9F
into aVARBINARY
unique column, the resulting error message uses UTF-8 with some bytes hex encoded:Duplicate entry 'A\xC3\x9F' for key 1
To return a message to the client after it has been constructed,
the server converts it from UTF-8 to the character set specified
by the character_set_results
system variable. If
character_set_results
has a
value of NULL
or binary
,
no conversion occurs. No conversion occurs if the variable value
is utf8
, either, because that matches the
original error message character set.
For characters that cannot be represented in
character_set_results
, some
encoding may occur during the conversion. The encoding uses
Unicode code point values:
Characters in the Basic Multilingual Plane (BMP) range (
0x0000
to0xFFFF
) are written using\
notation.nnnn
Characters outside the BMP range (
0x01000
to0x10FFFF
) are written using\+
notation.nnnnnn
Clients can set
character_set_results
to
control the character set in which they receive error messages.
The variable can be set directly, or indirectly by means such as
SET
NAMES
. For more information about
character_set_results
, see
Section 9.1.4, “Connection Character Sets and Collations”.
Prior to MySQL 5.5, the server constructs error messages and returns them to clients as follows:
The message template has the character set associated with the error message language. For example, English, Korean, and Russian messages use
latin1
,euckr
, andkoi8r
, respectively.Parameters in the message template are replaced with values that apply to a specific error occurrence. These parameters use their own character set. Identifiers such as table or column names use UTF-8. Data values retain their character set. For example, in the following duplicate-key message,
'
has the character set of the table column associated with key 1:xxx
'Duplicate entry '
xxx
' for key1
The preceding method of error-message construction can result in messages that contain a mix of character sets unless all items involved contain only ASCII characters.
For MySQL 5.5 and higher, the encoding that occurs during the
conversion to
character_set_results
before
returning error messages to clients can result in different
message content compared to earlier versions. For example, if an
error occurs for an attempt to drop a table named
ペ
(KATAKANA LETTER PE) and
character_set_results
is a
character set such as latin1
that does not
contain that character, the resulting message sent to the client
has an encoded table name:
ERROR 1051 (42S02): Unknown table '\30DA'
Before MySQL 5.5, the name is not encoded:
ERROR 1051 (42S02): Unknown table 'ペ'
- 9.1.7.1. Collation Names
- 9.1.7.2. Using
COLLATE
in SQL Statements - 9.1.7.3.
COLLATE
Clause Precedence - 9.1.7.4. Collations Must Be for the Right Character Set
- 9.1.7.5. Collation of Expressions
- 9.1.7.6. The
_bin
andbinary
Collations - 9.1.7.7. The
BINARY
Operator - 9.1.7.8. Examples of the Effect of Collation
- 9.1.7.9. Collation and
INFORMATION_SCHEMA
Searches
The following sections discuss various aspects of character set collations.
MySQL collation names follow these rules:
A name ending in
_ci
indicates a case-insensitive collation.A name ending in
_cs
indicates a case-sensitive collation.A name ending in
_bin
indicates a binary collation. Character comparisons are based on character binary code values.
With the COLLATE
clause, you can override
whatever the default collation is for a comparison.
COLLATE
may be used in various parts of SQL
statements. Here are some examples:
With
ORDER BY
:SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
With
AS
:SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
With
GROUP BY
:SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
With
DISTINCT
:SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
With
WHERE
:SELECT * FROM t1 WHERE _latin1 'Müller' COLLATE latin1_german2_ci = k;
SELECT * FROM t1 WHERE k LIKE _latin1 'Müller' COLLATE latin1_german2_ci;
With
HAVING
:SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'Müller' COLLATE latin1_german2_ci;
The COLLATE
clause has high precedence
(higher than ||
),
so the following two expressions are equivalent:
x || y COLLATE z x || (y COLLATE z)
Each character set has one or more collations, but each
collation is associated with one and only one character set.
Therefore, the following statement causes an error message
because the latin2_bin
collation is not
legal with the latin1
character set:
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
In the great majority of statements, it is obvious what
collation MySQL uses to resolve a comparison operation. For
example, in the following cases, it should be clear that the
collation is the collation of column x
:
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, with multiple operands, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
Should the comparison use the collation of the column
x
, or of the string literal
'Y'
? Both x
and
'Y'
have collations, so which collation
takes precedence?
Standard SQL resolves such questions using what used to be called “coercibility” rules. MySQL assigns coercibility values as follows:
An explicit
COLLATE
clause has a coercibility of 0. (Not coercible at all.)The concatenation of two strings with different collations has a coercibility of 1.
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
A “system constant” (the string returned by functions such as
USER()
orVERSION()
) has a coercibility of 3.The collation of a literal has a coercibility of 4.
NULL
or an expression that is derived fromNULL
has a coercibility of 5.
MySQL uses coercibility values with the following rules to resolve ambiguities:
Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then:
If both sides are Unicode, or both sides are not Unicode, it is an error.
If one of the sides has a Unicode character set, and another side has a non-Unicode character set, the side with Unicode character set wins, and automatic character set conversion is applied to the non-Unicode side. For example, the following statement does not return an error:
SELECT CONCAT(utf8_column, latin1_column) FROM t1;
It returns a result that has a character set of
utf8
and the same collation asutf8_column
. Values oflatin1_column
are automatically converted toutf8
before concatenating.For an operation with operands from the same character set but that mix a
_bin
collation and a_ci
or_cs
collation, the_bin
collation is used. This is similar to how operations that mix nonbinary and binary strings evaluate the operands as binary strings, except that it is for collations rather than data types.
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
Examples:
Comparison | Collation Used |
---|---|
column1 = 'A' | Use collation of column1 |
column1 = 'A' COLLATE x | Use collation of 'A' COLLATE x |
column1 COLLATE x = 'A' COLLATE y | Error |
The COERCIBILITY()
function can
be used to determine the coercibility of a string expression:
mysql>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);
-> 0 mysql>SELECT COERCIBILITY(VERSION());
-> 3 mysql>SELECT COERCIBILITY('A');
-> 4
See Section 11.14, “Information Functions”.
For implicit conversion of a numeric or temporal value to a
string, such as occurs for the argument 1
in the expression CONCAT(1,
'abc')
, the result is a character (nonbinary) string
that has a character set and collation determined by the
character_set_connection
and
collation_connection
system
variables. (Before MySQL 5.5.3, the result is a binary string
for which the character set and collation are
binary
). See
Section 11.2, “Type Conversion in Expression Evaluation”.
This section describes how _bin
collations
for nonbinary strings differ from the
binary
“collation” for binary
strings.
Nonbinary strings (as stored in the
CHAR
,
VARCHAR
, and
TEXT
data types) have a
character set and collation. A given character set can have
several collations, each of which defines a particular sorting
and comparison order for the characters in the set. One of
these is the binary collation for the character set, indicated
by a _bin
suffix in the collation name. For
example, latin1
and utf8
have binary collations named latin1_bin
and
utf8_bin
.
Binary strings (as stored in the
BINARY
,
VARBINARY
, and
BLOB
data types) have no
character set or collation in the sense that nonbinary strings
do. (Applied to a binary string, the
CHARSET()
and
COLLATION()
functions both return a value
of binary
.) Binary strings are sequences of
bytes and the numeric values of those bytes determine sort
order.
The _bin
collations differ from the
binary
collation in several respects.
The unit for sorting and
comparison. Binary strings are sequences of bytes.
Sorting and comparison is always based on numeric byte values.
Nonbinary strings are sequences of characters, which might be
multi-byte. Collations for nonbinary strings define an
ordering of the character values for sorting and comparison.
For the _bin
collation, this ordering is
based solely on binary code values of the characters (which is
similar to ordering for binary strings except that a
_bin
collation must take into account that
a character might contain multiple bytes). For other
collations, character ordering might take additional factors
such as lettercase into account.
Character set conversion. A
nonbinary string has a character set and is converted to
another character set in many cases, even when the string has
a _bin
collation:
When assigning column values from another column that has a different character set:
UPDATE t1 SET utf8_bin_column=latin1_column; INSERT INTO t1 (latin1_column) SELECT utf8_bin_column FROM t2;
When assigning column values for
INSERT
orUPDATE
using a string literal:SET NAMES latin1; INSERT INTO t1 (utf8_bin_column) VALUES ('string-in-latin1');
When sending results from the server to a client:
SET NAMES latin1; SELECT utf8_bin_column FROM t2;
For binary string columns, no conversion occurs. For the preceding cases, the string value is copied byte-wise.
Lettercase conversion.
Collations provide information about lettercase of characters,
so characters in a nonbinary string can be converted from one
lettercase to another, even for _bin
collations that ignore lettercase for ordering:
mysql>SET NAMES latin1 COLLATE latin1_bin;
Query OK, 0 rows affected (0.02 sec) mysql>SELECT LOWER('aA'), UPPER('zZ');
+-------------+-------------+ | LOWER('aA') | UPPER('zZ') | +-------------+-------------+ | aa | ZZ | +-------------+-------------+ 1 row in set (0.13 sec)
The concept of lettercase does not apply to bytes in a binary string. To perform lettercase conversion, the string must be converted to a nonbinary string:
mysql>SET NAMES binary;
Query OK, 0 rows affected (0.00 sec) mysql>SELECT LOWER('aA'), LOWER(CONVERT('aA' USING latin1));
+-------------+-----------------------------------+ | LOWER('aA') | LOWER(CONVERT('aA' USING latin1)) | +-------------+-----------------------------------+ | aA | aa | +-------------+-----------------------------------+ 1 row in set (0.00 sec)
Trailing space handling in
comparisons. Nonbinary strings have
PADSPACE
behavior for all collations,
including _bin
collations. Trailing spaces
are insignificant in comparisons:
mysql>SET NAMES utf8 COLLATE utf8_bin;
Query OK, 0 rows affected (0.00 sec) mysql>SELECT 'a ' = 'a';
+------------+ | 'a ' = 'a' | +------------+ | 1 | +------------+ 1 row in set (0.00 sec)
For binary strings, all characters are significant in comparisons, including trailing spaces:
mysql>SET NAMES binary;
Query OK, 0 rows affected (0.00 sec) mysql>SELECT 'a ' = 'a';
+------------+ | 'a ' = 'a' | +------------+ | 0 | +------------+ 1 row in set (0.00 sec)
Trailing space handling for inserts and
retrievals.
CHAR(
columns
store nonbinary strings. Values shorter than
N
)N
characters are extended with
spaces on insertion. For retrieval, trailing spaces are
removed.
BINARY(
columns store binary strings. Values shorter than
N
)N
bytes are extended with
0x00
bytes on insertion. For retrieval,
nothing is removed; a value of the declared length is always
returned.
mysql>CREATE TABLE t1 (
->a CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin,
->b BINARY(10)
->);
Query OK, 0 rows affected (0.09 sec) mysql>INSERT INTO t1 VALUES ('a','a');
Query OK, 1 row affected (0.01 sec) mysql>SELECT HEX(a), HEX(b) FROM t1;
+--------+----------------------+ | HEX(a) | HEX(b) | +--------+----------------------+ | 61 | 61000000000000000000 | +--------+----------------------+ 1 row in set (0.04 sec)
The BINARY
operator casts the
string following it to a binary string. This is an easy way to
force a comparison to be done byte by byte rather than
character by character. BINARY
also causes trailing spaces to be significant.
mysql>SELECT 'a' = 'A';
-> 1 mysql>SELECT BINARY 'a' = 'A';
-> 0 mysql>SELECT 'a' = 'a ';
-> 1 mysql>SELECT BINARY 'a' = 'a ';
-> 0
BINARY
is
shorthand for
str
CAST(
.
str
AS
BINARY)
The BINARY
attribute in character column
definitions has a different effect. A character column defined
with the BINARY
attribute is assigned the
binary collation of the column character set. Every character
set has a binary collation. For example, the binary collation
for the latin1
character set is
latin1_bin
, so if the table default
character set is latin1
, these two column
definitions are equivalent:
CHAR(10) BINARY CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
The effect of BINARY
as a column attribute
differs from its effect prior to MySQL 4.1. Formerly,
BINARY
resulted in a column that was
treated as a binary string. A binary string is a string of
bytes that has no character set or collation, which differs
from a nonbinary character string that has a binary collation.
For both types of strings, comparisons are based on the
numeric values of the string unit, but for nonbinary strings
the unit is the character and some character sets support
multi-byte characters. Section 10.4.2, “The BINARY
and
VARBINARY
Types”.
The use of CHARACTER SET binary
in the
definition of a CHAR
,
VARCHAR
, or
TEXT
column causes the column
to be treated as a binary data type. For example, the
following pairs of definitions are equivalent:
CHAR(10) CHARACTER SET binary BINARY(10) VARCHAR(10) CHARACTER SET binary VARBINARY(10) TEXT CHARACTER SET binary BLOB
Пример 1: Sorting German Umlauts
Suppose that column X
in table
T
has these latin1
column values:
Muffler Müller MX Systems MySQL
Suppose also that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name
;
The following table shows the resulting order of the values if
we use ORDER BY
with different collations.
latin1_swedish_ci | latin1_german1_ci | latin1_german2_ci |
---|---|---|
Muffler | Muffler | Müller |
MX Systems | Müller | Muffler |
Müller | MX Systems | MX Systems |
MySQL | MySQL | MySQL |
The character that causes the different sort orders in this
example is the U with two dots over it
(ü
), which the Germans call
“U-umlaut.”
The first column shows the result of the
SELECT
using the Swedish/Finnish collating rule, which says that U-umlaut sorts with Y.The second column shows the result of the
SELECT
using the German DIN-1 rule, which says that U-umlaut sorts with U.The third column shows the result of the
SELECT
using the German DIN-2 rule, which says that U-umlaut sorts with UE.
Пример 2: Searching for German Umlauts
Suppose that you have three tables that differ only by the character set and collation used:
mysql>SET NAMES utf8;
mysql>CREATE TABLE german1 (
->c CHAR(10)
->) CHARACTER SET latin1 COLLATE latin1_german1_ci;
mysql>CREATE TABLE german2 (
->c CHAR(10)
->) CHARACTER SET latin1 COLLATE latin1_german2_ci;
mysql>CREATE TABLE germanutf8 (
->c CHAR(10)
->) CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Each table contains two records:
mysql>INSERT INTO german1 VALUES ('Bar'), ('Bär');
mysql>INSERT INTO german2 VALUES ('Bar'), ('Bär');
mysql>INSERT INTO germanutf8 VALUES ('Bar'), ('Bär');
Two of the above collations have an A = Ä
equality, and one has no such equality
(latin1_german2_ci
). For that reason,
you'll get these results in comparisons:
mysql>SELECT * FROM german1 WHERE c = 'Bär';
+------+ | c | +------+ | Bar | | Bär | +------+ mysql>SELECT * FROM german2 WHERE c = 'Bär';
+------+ | c | +------+ | Bär | +------+ mysql>SELECT * FROM germanutf8 WHERE c = 'Bär';
+------+ | c | +------+ | Bar | | Bär | +------+
This is not a bug but rather a consequence of the sorting
properties of latin1_german1_ci
and
utf8_unicode_ci
(the sorting shown is done
according to the German DIN 5007 standard).
String columns in INFORMATION_SCHEMA
tables
have a collation of utf8_general_ci
, which
is case insensitive. However, searches in
INFORMATION_SCHEMA
string columns are also
affected by file system case sensitivity. For values that
correspond to objects that are represented in the file system,
such as names of databases and tables, searches may be case
sensitive if the file system is case sensitive. This section
describes how to work around this issue if necessary; see also
Bug #34921.
Suppose that a query searches the
SCHEMATA.SCHEMA_NAME
column for the
test
database. On Linux, file systems are
case sensitive, so comparisons of
SCHEMATA.SCHEMA_NAME
with
'test'
match, but comparisons with
'TEST'
do not:
mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME = 'test';
+-------------+ | SCHEMA_NAME | +-------------+ | test | +-------------+ 1 row in set (0.01 sec) mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME = 'TEST';
Empty set (0.00 sec)
On Windows or Mac OS X where file systems are not case
sensitive, comparisons match both 'test'
and 'TEST'
:
mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME = 'test';
+-------------+ | SCHEMA_NAME | +-------------+ | test | +-------------+ 1 row in set (0.00 sec) mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME = 'TEST';
+-------------+ | SCHEMA_NAME | +-------------+ | TEST | +-------------+ 1 row in set (0.00 sec)
The value of the
lower_case_table_names
system
variable makes no difference in this context.
This behavior occurs because the
utf8_general_ci
collation is not used for
INFORMATION_SCHEMA
queries when searching
the file system for database objects. It is a result of
optimizations implemented for
INFORMATION_SCHEMA
searches in MySQL. For
information about these optimizations, see
Section 7.2.4, “Optimizing INFORMATION_SCHEMA
Queries”.
Searches in INFORMATION_SCHEMA
string
columns for values that refer to
INFORMATION_SCHEMA
itself do use the
utf8_general_ci
collation because
INFORMATION_SCHEMA
is a
“virtual” database and is not represented in the
file system. For example, comparisons with
SCHEMATA.SCHEMA_NAME
match
'information_schema'
or
'INFORMATION_SCHEMA'
regardless of
platform:
mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME = 'information_schema';
+--------------------+ | SCHEMA_NAME | +--------------------+ | information_schema | +--------------------+ 1 row in set (0.00 sec) mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME = 'INFORMATION_SCHEMA';
+--------------------+ | SCHEMA_NAME | +--------------------+ | information_schema | +--------------------+ 1 row in set (0.00 sec)
If the result of a string operation on an
INFORMATION_SCHEMA
column differs from
expectations, a workaround is to use an explicit
COLLATE
clause to force a suitable
collation (Section 9.1.7.2, “Using COLLATE
in SQL Statements”). For example, to
perform a case-insensitive search, use
COLLATE
with the
INFORMATION_SCHEMA
column name:
mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME COLLATE utf8_general_ci = 'test';
+-------------+ | SCHEMA_NAME | +-------------+ | test | +-------------+ 1 row in set (0.00 sec) mysql>SELECT SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA
->WHERE SCHEMA_NAME COLLATE utf8_general_ci = 'TEST';
| SCHEMA_NAME | +-------------+ | test | +-------------+ 1 row in set (0.00 sec)
You can also use the UPPER()
or
LOWER()
function:
WHERE UPPER(SCHEMA_NAME) = 'TEST' WHERE LOWER(SCHEMA_NAME) = 'test'
Although a case-insensitive comparison can be performed even
on platforms with case-sensitive file systems, as just shown,
it is not necessarily always the right thing to do. On such
platforms, it is possible to have multiple objects with names
that differ only in lettercase. For example, tables named
city
, CITY
, and
City
can all exist simultaneously. Consider
whether a search should match all such names or just one and
write queries accordingly:
WHERE TABLE_NAME COLLATE utf8_bin = 'City' WHERE TABLE_NAME COLLATE utf8_general_ci = 'city' WHERE UPPER(TABLE_NAME) = 'CITY' WHERE LOWER(TABLE_NAME) = 'city'
The first of those comparisons (with
utf8_bin
) is case sensitive; the others are
not.
The repertoire of a character set is the collection of characters in the set.
String expressions have a repertoire attribute, which can have two values:
ASCII
: The expression can contain only characters in the Unicode rangeU+0000
toU+007F
.UNICODE
: The expression can contain characters in the Unicode rangeU+0000
toU+FFFF
.
The ASCII
range is a subset of
UNICODE
range, so a string with
ASCII
repertoire can be converted safely
without loss of information to the character set of any string
with UNICODE
repertoire or to a character set
that is a superset of ASCII
. (All MySQL
character sets are supersets of ASCII
with
the exception of swe7
, which reuses some
punctuation characters for Swedish accented characters.) The use
of repertoire enables character set conversion in expressions
for many cases where MySQL would otherwise return an
“illegal mix of collations” error.
The following discussion provides examples of expressions and their repertoires, and describes how the use of repertoire changes string expression evaluation:
The repertoire for string constants depends on string content:
SET NAMES utf8; SELECT 'abc'; SELECT _utf8'def'; SELECT N'MySQL';
Although the character set is
utf8
in each of the preceding cases, the strings do not actually contain any characters outside the ASCII range, so their repertoire isASCII
rather thanUNICODE
.Columns having the
ascii
character set haveASCII
repertoire because of their character set. In the following table,c1
hasASCII
repertoire:CREATE TABLE t1 (c1 CHAR(1) CHARACTER SET ascii);
The following example illustrates how repertoire enables a result to be determined in a case where an error occurs without repertoire:
CREATE TABLE t1 ( c1 CHAR(1) CHARACTER SET latin1, c2 CHAR(1) CHARACTER SET ascii ); INSERT INTO t1 VALUES ('a','b'); SELECT CONCAT(c1,c2) FROM t1;
Without repertoire, this error occurs:
ERROR 1267 (HY000): Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (ascii_general_ci,IMPLICIT) for operation 'concat'
Using repertoire, subset to superset (
ascii
tolatin1
) conversion can occur and a result is returned:+---------------+ | CONCAT(c1,c2) | +---------------+ | ab | +---------------+
Functions with one string argument inherit the repertoire of their argument. The result of
UPPER(_utf8'
hasabc
')ASCII
repertoire because its argument hasASCII
repertoire.For functions that return a string but do not have string arguments and use
character_set_connection
as the result character set, the result repertoire isASCII
ifcharacter_set_connection
isascii
, andUNICODE
otherwise:FORMAT(
numeric_column
, 4);Use of repertoire changes how MySQL evaluates the following example:
SET NAMES ascii; CREATE TABLE t1 (a INT, b VARCHAR(10) CHARACTER SET latin1); INSERT INTO t1 VALUES (1,'b'); SELECT CONCAT(FORMAT(a, 4), b) FROM t1;
Without repertoire, this error occurs:
ERROR 1267 (HY000): Illegal mix of collations (ascii_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation 'concat'
With repertoire, a result is returned:
+-------------------------+ | CONCAT(FORMAT(a, 4), b) | +-------------------------+ | 1.0000b | +-------------------------+
Functions with two or more string arguments use the “widest” argument repertoire for the result repertoire (
UNICODE
is wider thanASCII
). Consider the followingCONCAT()
calls:CONCAT(_ucs2 0x0041, _ucs2 0x0042) CONCAT(_ucs2 0x0041, _ucs2 0x00C2)
For the first call, the repertoire is
ASCII
because both arguments are within the range of theascii
character set. For the second call, the repertoire isUNICODE
because the second argument is outside theascii
character set range.The repertoire for function return values is determined based only on the repertoire of the arguments that affect the result's character set and collation.
IF(column1 < column2, 'smaller', 'greater')
The result repertoire is
ASCII
because the two string arguments (the second argument and the third argument) both haveASCII
repertoire. The first argument does not matter for the result repertoire, even if the expression uses string values.
This section describes operations that take character set information into account.
MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?
For simple functions that take string input and return a
string result as output, the output's character set and
collation are the same as those of the principal input value.
For example,
UPPER(
returns a string whose character string and collation are the
same as that of X
)X
. The same applies
for INSTR()
,
LCASE()
,
LOWER()
,
LTRIM()
,
MID()
,
REPEAT()
,
REPLACE()
,
REVERSE()
,
RIGHT()
,
RPAD()
,
RTRIM()
,
SOUNDEX()
,
SUBSTRING()
,
TRIM()
,
UCASE()
, and
UPPER()
.
Note: The REPLACE()
function,
unlike all other functions, always ignores the collation of
the string input and performs a case-sensitive comparison.
If a string input or function result is a binary string, the
string has no character set or collation. This can be checked
by using the CHARSET()
and
COLLATION()
functions, both of
which return binary
to indicate that their
argument is a binary string:
mysql> SELECT CHARSET(BINARY 'a'), COLLATION(BINARY 'a');
+---------------------+-----------------------+
| CHARSET(BINARY 'a') | COLLATION(BINARY 'a') |
+---------------------+-----------------------+
| binary | binary |
+---------------------+-----------------------+
For operations that combine multiple string inputs and return a single string output, the “aggregation rules” of standard SQL apply for determining the collation of the result:
If an explicit
COLLATE
occurs, useX
X
.If explicit
COLLATE
andX
COLLATE
occur, raise an error.Y
Otherwise, if all collations are
X
, useX
.Otherwise, the result has no collation.
For example, with CASE ... WHEN a THEN b WHEN b THEN
c COLLATE
, the
resulting collation is X
ENDX
. The same
applies for UNION
,
||
,
CONCAT()
,
ELT()
,
GREATEST()
,
IF()
, and
LEAST()
.
For operations that convert to character data, the character
set and collation of the strings that result from the
operations are defined by the
character_set_connection
and
collation_connection
system
variables. This applies only to
CAST()
,
CONV()
,
FORMAT()
,
HEX()
, and
SPACE()
.
If you are uncertain about the character set or collation of
the result returned by a string function, you can use the
CHARSET()
or
COLLATION()
function to find
out:
mysql> SELECT USER(), CHARSET(USER()), COLLATION(USER());
+----------------+-----------------+-------------------+
| USER() | CHARSET(USER()) | COLLATION(USER()) |
+----------------+-----------------+-------------------+
| test@localhost | utf8 | utf8_general_ci |
+----------------+-----------------+-------------------+
CONVERT()
provides a way to
convert data between different character sets. The syntax is:
CONVERT(expr
USINGtranscoding_name
)
In MySQL, transcoding names are the same as the corresponding character set names.
Examples:
SELECT CONVERT(_latin1'Müller' USING utf8); INSERT INTO utf8table (utf8column) SELECT CONVERT(latin1field USING utf8) FROM latin1table;
CONVERT(... USING ...)
is
implemented according to the standard SQL specification.
You may also use CAST()
to
convert a string to a different character set. The syntax is:
CAST(character_string
AScharacter_data_type
CHARACTER SETcharset_name
)
Пример:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
If you use CAST()
without
specifying CHARACTER SET
, the resulting
character set and collation are defined by the
character_set_connection
and
collation_connection
system
variables. If you use CAST()
with CHARACTER SET X
, the resulting
character set and collation are X
and the
default collation of X
.
You may not use a COLLATE
clause inside a
CONVERT()
or
CAST()
call, but you may use it
outside. For example, CAST(... COLLATE
...)
is illegal, but CAST(...)
COLLATE ...
is legal:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
Several SHOW
statements provide
additional character set information. These include
SHOW CHARACTER SET
,
SHOW COLLATION
,
SHOW CREATE DATABASE
,
SHOW CREATE TABLE
and
SHOW COLUMNS
. These statements
are described here briefly. For more information, see
Section 12.7.5, “SHOW
Синтаксис”.
INFORMATION_SCHEMA
has several tables that
contain information similar to that displayed by the
SHOW
statements. For example,
the CHARACTER_SETS
and
COLLATIONS
tables contain the
information displayed by SHOW CHARACTER
SET
and SHOW
COLLATION
. See Глава 19, INFORMATION_SCHEMA
Tables.
The SHOW CHARACTER SET
statement shows all available character sets. It takes an
optional LIKE
clause that
indicates which character set names to match. For example:
mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Описание | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1 | cp1252 West European | latin1_swedish_ci | 1 |
| latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 |
+---------+-----------------------------+-------------------+--------+
The output from SHOW COLLATION
includes all available character sets. It takes an optional
LIKE
clause that indicates which
collation names to match. For example:
mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1 | 5 | | | 0 |
| latin1_swedish_ci | latin1 | 8 | Yes | Yes | 0 |
| latin1_danish_ci | latin1 | 15 | | | 0 |
| latin1_german2_ci | latin1 | 31 | | Yes | 2 |
| latin1_bin | latin1 | 47 | | Yes | 0 |
| latin1_general_ci | latin1 | 48 | | | 0 |
| latin1_general_cs | latin1 | 49 | | | 0 |
| latin1_spanish_ci | latin1 | 94 | | | 0 |
+-------------------+---------+----+---------+----------+---------+
SHOW CREATE DATABASE
displays
the CREATE DATABASE
statement
that creates a given database:
mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database |
+----------+-----------------------------------------------------------------+
| test | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
If no COLLATE
clause is shown, the default
collation for the character set applies.
SHOW CREATE TABLE
is similar,
but displays the CREATE TABLE
statement to create a given table. The column definitions
indicate any character set specifications, and the table
options include character set information.
The SHOW COLUMNS
statement
displays the collations of a table's columns when invoked as
SHOW FULL
COLUMNS
. Columns with
CHAR
,
VARCHAR
, or
TEXT
data types have
collations. Numeric and other noncharacter types have no
collation (indicated by NULL
as the
Collation
value). For example:
mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
Field: id
Type: smallint(5) unsigned
Collation: NULL
Null: NO
Key: PRI
Default: NULL
Extra: auto_increment
Privileges: select,insert,update,references
Comment:
*************************** 2. row ***************************
Field: name
Type: char(60)
Collation: latin1_swedish_ci
Null: NO
Key:
Default:
Extra:
Privileges: select,insert,update,references
Comment:
The character set is not part of the display but is implied by the collation name.
- 9.1.10.1. The
ucs2
Character Set (UCS-2 Unicode Encoding) - 9.1.10.2. The
utf16
Character Set (UTF-16 Unicode Encoding) - 9.1.10.3. The
utf32
Character Set (UTF-32 Unicode Encoding) - 9.1.10.4. The
utf8
Character Set (Three-Byte UTF-8 Unicode Encoding) - 9.1.10.5. The
utf8mb3
“Character Set” (Alias forutf8
) - 9.1.10.6. The
utf8mb4
Character Set (Four-Byte UTF-8 Unicode Encoding)
The initial implementation of Unicode support (in MySQL 4.1) included two character sets for storing Unicode data:
ucs2
, the UCS-2 encoding of the Unicode character set using 16 bits per character.utf8
, a UTF-8 encoding of the Unicode character set using one to three bytes per character.
These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:
Their code values are between 0 and 65535 (or
U+0000
..U+FFFF
).They can be encoded with a fixed 16-bit word, as in
ucs2
.They can be encoded with 8, 16, or 24 bits, as in
utf8
.They are sufficient for almost all characters in major languages.
Characters not supported by the aforementioned character sets
include supplementary characters that lie outside the BMP.
Characters outside the BMP compare as REPLACEMENT CHARACTER and
convert to '?'
when converted to a Unicode
character set.
As of MySQL 5.5.3, Unicode support includes supplementary characters, which requires new character sets that have a broader range and therefore take more space. The following table shows a brief feature comparison of previous and current Unicode support.
Before MySQL 5.5 | MySQL 5.5 and up |
---|---|
All Unicode 3.0 characters | All Unicode 5.0 characters |
No supplementary characters | With supplementary characters |
ucs2 character set, BMP only | No change |
utf8 character set for up to three bytes, BMP only | No change |
New utf8mb4 character set for up to four bytes, BMP
or supplemental | |
New utf16 character set, BMP or supplemental | |
New utf32 character set, BMP or supplemental |
These changes are upward compatible. If you want to use the new
character sets, there are potential incompatibility issues for
your applications; see
Section 9.1.11, “Upgrading from Previous to Current Unicode Support”. That section also
describes how to convert tables from utf8
to
the (four-byte) utf8mb4
character set, and
what constraints may apply in doing so.
MySQL 5.5 supports these Unicode character sets:
ucs2
, the UCS-2 encoding of the Unicode character set using 16 bits per character.utf16
, the UTF-16 encoding for the Unicode character set; likeucs2
but with an extension for supplementary characters.utf32
, the UTF-32 encoding for the Unicode character set using 32 bits per character.utf8
, a UTF-8 encoding of the Unicode character set using one to three bytes per character.utf8mb4
, a UTF-8 encoding of the Unicode character set using one to four bytes per character.
ucs2
and utf8
support BMP
characters. utf8mb4
,
utf16
, and utf32
support
BMP and supplementary characters.
A similar set of collations is available for each Unicode
character set. For example, each has a Danish collation, the
names of which are ucs2_danish_ci
,
utf16_danish_ci
,
utf32_danish_ci
,
utf8_danish_ci
, and
utf8mb4_danish_ci
. All Unicode collations are
listed at Section 9.1.14.1, “Unicode Character Sets”, which also
describes collation properties for supplementary characters.
Note that although many of the supplementary characters come from East Asian languages, what MySQL 5.5 adds is support for more Japanese and Chinese characters in Unicode character sets, not support for new Japanese and Chinese character sets.
The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values will need to be performed when transferring data between those systems and MySQL.
MySQL uses no BOM for UTF-8 values.
Client applications that need to communicate with the server
using Unicode should set the client character set accordingly;
for example, by issuing a SET NAMES 'utf8'
statement. ucs2
, utf16
,
and utf32
cannot be used as a client
character set, which means that they do not work for
SET NAMES
or SET CHARACTER
SET
. (See Section 9.1.4, “Connection Character Sets and Collations”.)
The following sections provide additional detail on the Unicode character sets in MySQL.
In UCS-2, every character is represented by a two-byte Unicode
code with the most significant byte first. For example:
LATIN CAPITAL LETTER A
has the code
0x0041
and it is stored as a two-byte
sequence: 0x00 0x41
. CYRILLIC
SMALL LETTER YERU
(Unicode
0x044B
) is stored as a two-byte sequence:
0x04 0x4B
. For Unicode characters and their
codes, please refer to the
Unicode Home
Page.
In MySQL, the ucs2
character set is a
fixed-length 16-bit encoding for Unicode BMP characters.
The utf16
character set is the
ucs2
character set with an extension that
enables encoding of supplementary characters:
For a BMP character,
utf16
anducs2
have identical storage characteristics: same code values, same encoding, same length.For a supplementary character,
utf16
has a special sequence for representing the character using 32 bits. This is called the “surrogate” mechanism: For a number greater than0xffff
, take 10 bits and add them to0xd800
and put them in the first 16-bit word, take 10 more bits and add them to0xdc00
and put them in the next 16-bit word. Consequently, all supplementary characters require 32 bits, where the first 16 bits are a number between0xd800
and0xdbff
, and the last 16 bits are a number between0xdc00
and0xdfff
. Examples are in Section 15.5 Surrogates Area of the Unicode 4.0 document.
Because utf16
supports surrogates and
ucs2
does not, there is a validity check
that applies only in utf16
: You cannot
insert a top surrogate without a bottom surrogate, or vice
versa. For example:
INSERT INTO t (ucs2_column) VALUES (0xd800); /* legal */ INSERT INTO t (utf16_column)VALUES (0xd800); /* illegal */
There is no validity check for characters that are technically
valid but are not true Unicode (that is, characters that
Unicode considers to be “unassigned code points”
or “private use” characters or even
“illegals” like 0xffff
). For
example, since U+F8FF
is the Apple Logo,
this is legal:
INSERT INTO t (utf16_column)VALUES (0xf8ff); /* legal */
Such characters cannot be expected to mean the same thing to everyone.
Because MySQL must allow for the worst case (that one
character requires four bytes) the maximum length of a
utf16
column or index is only half of the
maximum length for a ucs2
column or index.
For example, in MySQL 5.5, the maximum length of
a MEMORY
table index key is 3072 bytes, so
these statements create tables with the longest permitted
indexes for ucs2
and
utf16
columns:
CREATE TABLE tf (s1 VARCHAR(1536) CHARACTER SET ucs2) ENGINE=MEMORY; CREATE INDEX i ON tf (s1); CREATE TABLE tg (s1 VARCHAR(768) CHARACTER SET utf16) ENGINE=MEMORY; CREATE INDEX i ON tg (s1);
The utf32
character set is fixed length
(like ucs2
and unlike
utf16
). utf32
uses 32
bits for every character, unlike ucs2
(which uses 16 bits for every character), and unlike
utf16
(which uses 16 bits for some
characters and 32 bits for others).
utf32
takes twice as much space as
ucs2
and more space than
utf16
, but utf32
has the
same advantage as ucs2
that it is
predictable for storage: The required number of bytes for
utf32
equals the number of characters times
4. Also, unlike utf16
, there are no tricks
for encoding in utf32
, so the stored value
equals the code value.
To demonstrate how the latter advantage is useful, here is an
example that shows how to determine a
utf8mb4
value given the
utf32
code value:
/* Assume code value = 100cc LINEAR B WHEELED CHARIOT */ CREATE TABLE tmp (utf32_col CHAR(1) CHARACTER SET utf32, utf8mb4_col CHAR(1) CHARACTER SET utf8mb4); INSERT INTO tmp VALUES (0x000100cc,NULL); UPDATE tmp SET utf8mb4_col = utf32_col; SELECT HEX(utf32_col),HEX(utf8mb4_col) FROM tmp;
MySQL is very forgiving about additions of unassigned Unicode
characters or private-use-area characters. There is in fact
only one validity check for utf32
: No code
value may be greater than 0x10ffff
. For
example, this is illegal:
INSERT INTO t (utf32_column) VALUES (0x110000); /* illegal */
UTF-8 (Unicode Transformation Format with 8-bit units) is an alternative way to store Unicode data. It is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. (An older standard for UTF-8 encoding, RFC 2279, describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte or four-byte sequences.
The utf8
character set is the same in MySQL
5.5 as before 5.5 and has exactly
the same characteristics:
No support for supplementary characters (BMP characters only).
A maximum of three bytes per multi-byte character.
Exactly the same set of characters is available in
utf8
as in ucs2
. That
is, they have the same repertoire.
In a future version of MySQL, it is possible that
utf8
will become the 4-byte
utf8
, and that users who want to indicate
3-byte utf8
will have to say
utf8mb3
. To avoid some future problems
which might occur with replication when master and slave
servers have different MySQL versions, it is possible as of
MySQL 5.5.3 for users to specify utf8mb3
in
CHARACTER SET
clauses, and
utf8mb3_
in collation_substring
COLLATE
clauses, where
collation_substring
is
bin
, czech_ci
,
danish_ci
, esperanto_ci
,
estonian_ci
, and so forth. For example:
CREATE TABLE t (s1 CHAR(1) CHARACTER SET utf8mb3; SELECT * FROM t WHERE s1 COLLATE utf8mb3_general_ci = 'x'; DECLARE x VARCHAR(5) CHARACTER SET utf8mb3 COLLATE utf8mb3_danish_ci; SELECT CAST('a' AS CHAR CHARACTER SET utf8) COLLATE utf8_czech_ci;
MySQL immediately converts instances of
utf8mb3
in an alias to
utf8
, so in statements such as
SHOW CREATE TABLE
or SELECT
CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS
or SELECT COLLATION_NAME FROM
INFORMATION_SCHEMA.COLUMNS
, users will see the true
name, utf8
or
utf8_
.
collation_substring
The utf8mb3
alias is valid only in
CHARACTER SET
clauses, and in certain other
places. For example, these are legal:
mysqld --character-set-server=utf8mb3 SET NAMES 'utf8mb3'; /* and other SET statements that have similar effect */ SELECT _utf8mb3 'a';
The character set named utf8
uses a maximum
of three bytes per character and contains only BMP characters.
As of MySQL 5.5.3, the utf8mb4
character
set uses a maximum of four bytes per character supports
supplemental characters:
For a BMP character,
utf8
andutf8mb4
have identical storage characteristics: same code values, same encoding, same length.For a supplementary character,
utf8
cannot store the character at all, whileutf8mb4
requires four bytes to store it. Sinceutf8
cannot store the character at all, you do not have any supplementary characters inutf8
columns and you need not worry about converting characters or losing data when upgradingutf8
data from older versions of MySQL.
utf8mb4
is a superset of
utf8
, so for an operation such as the
following concatenation, the result has character set
utf8mb4
and the collation of
utf8mb4_col
:
SELECT CONCAT(utf8_col, utf8mb4_col);
Similarly, the following comparison in the
WHERE
clause works according to the
collation of utf8mb_col
:
SELECT * FROM utf8_tbl, utf8mb4_tbl WHERE utf8_tbl.utf8_col = utf8mb4_tbl.utf8mb4_col;
To save space with UTF-8, use
VARCHAR
instead of
CHAR
. Otherwise, MySQL must
reserve three (or four) bytes for each character in a
CHAR CHARACTER SET utf8
(or
utf8mb4
) column because that is the
maximum possible length. For example, MySQL must reserve 40
bytes for a CHAR(10) CHARACTER SET
utf8mb4
column.
This section describes issues pertaining to Unicode support that you may face when upgrading to MySQL 5.5 from an older MySQL release. It also provides guidelines for downgrading from MySQL 5.5 to an older release.
In most respects, upgrading to MySQL 5.5 should present few problems with regard to Unicode usage, although there are some potential areas of incompatibility. These are the primary areas of concern:
For the variable-length character data types (
VARCHAR
and theTEXT
types), the maximum length in characters is less forutf8mb4
columns than forutf8
columns.For all character data types (
CHAR
,VARCHAR
, and theTEXT
types), the maximum number of characters that can be indexed is less forutf8mb4
columns than forutf8
columns.
Consequently, if you want to upgrade tables from
utf8
to utf8mb4
to take
advantage of supplementary-character support, it may be
necessary to change some column or index definitions.
Tables can be converted from utf8
to
utf8mb4
by using ALTER
TABLE
. Suppose that a table was originally defined as
follows:
CREATE TABLE t1 ( col1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, col2 CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL ) CHARACTER SET utf8;
The following statement converts t1
to use
utf8mb4
:
ALTER TABLE t1 DEFAULT CHARACTER SET utf8mb4, MODIFY col1 CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL, MODIFY col2 CHAR(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;
In terms of table content, conversion from
utf8
to utf8mb4
presents
no problems:
For a BMP character,
utf8
andutf8mb4
have identical storage characteristics: same code values, same encoding, same length.For a supplementary character,
utf8
cannot store the character at all, whileutf8mb4
requires four bytes to store it. Sinceutf8
cannot store the character at all, you do not have any supplementary characters inutf8
columns and you need not worry about converting characters or losing data when upgradingutf8
data from older versions of MySQL.
In terms of table structure, the catch when converting from
utf8
to utf8mb4
is that
the maximum length of a column or index key is unchanged in
terms of bytes. Therefore, it is smaller in
terms of characters because the maximum
length of a character is four bytes instead of three. For the
CHAR
,
VARCHAR
, and
TEXT
data types, watch for these
things when converting your MySQL tables:
Check all definitions of
utf8
columns and make sure they will not exceed the maximum length for the storage engine.Check all indexes on
utf8
columns and make sure they will not exceed the maximum length for the storage engine. Sometimes the maximum can change due to storage engine enhancements.
If the preceding conditions apply, you must either reduce the
defined length of columns or indexes, or continue to use
utf8
rather than utf8mb4
.
Here are some examples where structural changes may be needed:
A
TINYTEXT
column can hold up to 255 bytes, so it can hold up to 85 three-byte or 63 four-byte characters. Suppose that you have aTINYTEXT
column that usesutf8
but must be able to contain more than 63 characters. You cannot convert it toutf8mb4
unless you also change the data type to a longer type such asTEXT
.Similarly, a very long
VARCHAR
column may need to be changed to one of the longerTEXT
types if you want to convert it fromutf8
toutf8mb4
.InnoDB
has a maximum index length of 767 bytes, so forutf8
orutf8mb4
columns, you can index a maximum of 255 or 191 characters, respectively. If you currently haveutf8
columns with indexes longer than 191 characters, you will need to index a smaller number of characters. In anInnoDB
table, these column and index definitions are legal:col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255))
To use
utf8mb4
instead, the index must be smaller:col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))
The preceding types of changes are most likely to be required
only if you have very long columns or indexes. Otherwise, you
should be able to convert your tables from
utf8
to utf8mb4
without
problems. You can do this by using ALTER
TABLE
as described earlier in this section after
upgrading in place to 5.5.
The following items summarize other potential areas of incompatibility:
Performance of four-byte UTF-8 (
utf8mb4
) is slower than for three-byte UTF-8 (utf8
). If you do not want to incur this penalty, continue to useutf8
.SET NAMES 'utf8mb4'
causes use of the four-byte character set for connection character sets. As long as no four-byte characters are sent from the server, there should be no problems. Otherwise, applications that expect to receive a maximum of three bytes per character may have problems. Conversely, applications that expect to send four-byte characters must ensure that the server understands them.Applications cannot send
utf16
orutf32
character data to an older server that does not understand them.For replication, if the character sets that support supplementary characters are going to be used on the master, all slaves must understand them as well. If you attempt to replicate from a MySQL 5.5 master to an older slave,
utf8
data will be seen asutf8
by the slave and should replicate correctly. But you cannot sendutf8mb4
,utf16
, orutf32
data.Also, keep in mind the general principle that if a table has different definitions on the master and slave, this can lead to unexpected results. For example, the differences in limitations on index key length makes it risky to use
utf8
on the master andutf8mb4
on the slave.
If you have upgraded to MySQL 5.5, and then decide to downgrade back to an older release, these considerations apply:
ucs2
andutf8
data should present no problems.Any definitions that refer to the
utf8mb4
,utf16
, orutf32
character sets will not be recognized by the older server.For object definitions that refer to the
utf8mb4
character set, you can dump them with mysqldump in MySQL 5.5, edit the dump file to change instances ofutf8mb4
toutf8
, and reload the file in the older server, as long as there are no four-byte characters in the data. The older server will seeutf8
in the dump file object definitions and create new objects that use the (three-byte)utf8
character set.
Metadata is “the data about the
data.” Anything that describes the
database—as opposed to being the
contents of the database—is metadata.
Thus column names, database names, user names, version names,
and most of the string results from
SHOW
are metadata. This is also
true of the contents of tables in
INFORMATION_SCHEMA
because those tables by
definition contain information about database objects.
Representation of metadata must satisfy these requirements:
All metadata must be in the same character set. Otherwise, neither the
SHOW
statements norSELECT
statements for tables inINFORMATION_SCHEMA
would work properly because different rows in the same column of the results of these operations would be in different character sets.Metadata must include all characters in all languages. Otherwise, users would not be able to name columns and tables using their own languages.
To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.
The metadata requirements mean that the return values of the
USER()
,
CURRENT_USER()
,
SESSION_USER()
,
SYSTEM_USER()
,
DATABASE()
, and
VERSION()
functions have the
UTF-8 character set by default.
The server sets the
character_set_system
system
variable to the name of the metadata character set:
mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| character_set_system | utf8 |
+----------------------+-------+
Storage of metadata using Unicode does not
mean that the server returns headers of columns and the results
of DESCRIBE
functions in the
character_set_system
character
set by default. When you use SELECT column1 FROM
t
, the name column1
itself is
returned from the server to the client in the character set
determined by the value of the
character_set_results
system
variable, which has a default value of
latin1
. If you want the server to pass
metadata results back in a different character set, use the
SET NAMES
statement to force the server to
perform character set conversion. SET NAMES
sets the character_set_results
and other related system variables. (See
Section 9.1.4, “Connection Character Sets and Collations”.) Alternatively, a client
program can perform the conversion after receiving the result
from the server. It is more efficient for the client to perform
the conversion, but this option is not always available for all
clients.
If character_set_results
is set
to NULL
, no conversion is performed and the
server returns metadata using its original character set (the
set indicated by
character_set_system
).
Error messages returned from the server to the client are converted to the client character set automatically, as with metadata.
If you are using (for example) the
USER()
function for comparison or
assignment within a single statement, don't worry. MySQL
performs some automatic conversion for you.
SELECT * FROM t1 WHERE USER() = latin1_column;
This works because the contents of
latin1_column
are automatically converted to
UTF-8 before the comparison.
INSERT INTO t1 (latin1_column) SELECT USER();
This works because the contents of
USER()
are automatically
converted to latin1
before the assignment.
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a “subset” of Unicode. Because it is a well-known principle that “what applies to a superset can apply to a subset,” we believe that a collation for Unicode can apply for comparisons with non-Unicode strings. For more information about coercion of strings, see Section 9.1.7.5, “Collation of Expressions”.
To convert a binary or nonbinary string column to use a
particular character set, use ALTER
TABLE
. For successful conversion to occur, one of the
following conditions must apply:
If the column has a binary data type (
BINARY
,VARBINARY
,BLOB
), all the values that it contains must be encoded using a single character set (the character set you're converting the column to). If you use a binary column to store information in multiple character sets, MySQL has no way to know which values use which character set and cannot convert the data properly.If the column has a nonbinary data type (
CHAR
,VARCHAR
,TEXT
), its contents should be encoded in the column character set, not some other character set. If the contents are encoded in a different character set, you can convert the column to use a binary data type first, and then to a nonbinary column with the desired character set.
Suppose that a table t
has a binary column
named col1
defined as
VARBINARY(50)
. Assuming that the information
in the column is encoded using a single character set, you can
convert it to a nonbinary column that has that character set.
For example, if col1
contains binary data
representing characters in the greek
character set, you can convert it as follows:
ALTER TABLE t MODIFY col1 VARCHAR(50) CHARACTER SET greek;
If your original column has a type of
BINARY(50)
, you could convert it to
CHAR(50)
, but the resulting values will be
padded with 0x00
bytes at the end, which may
be undesirable. To remove these bytes, use the
TRIM()
function:
UPDATE t SET col1 = TRIM(TRAILING 0x00 FROM col1);
Suppose that table t
has a nonbinary column
named col1
defined as CHAR(50)
CHARACTER SET latin1
but you want to convert it to use
utf8
so that you can store values from many
languages. The following statement accomplishes this:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8;
Conversion may be lossy if the column contains characters that are not in both character sets.
A special case occurs if you have old tables from before MySQL
4.1 where a nonbinary column contains values that actually are
encoded in a character set different from the server's default
character set. For example, an application might have stored
sjis
values in a column, even though MySQL's
default character set was latin1
. It is
possible to convert the column to use the proper character set
but an additional step is required. Suppose that the server's
default character set was latin1
and
col1
is defined as
CHAR(50)
but its contents are
sjis
values. The first step is to convert the
column to a binary data type, which removes the existing
character set information without performing any character
conversion:
ALTER TABLE t MODIFY col1 BLOB;
The next step is to convert the column to a nonbinary data type with the proper character set:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET sjis;
This procedure requires that the table not have been modified
already with statements such as
INSERT
or
UPDATE
after an upgrade to MySQL
4.1 or later. In that case, MySQL would store new values in the
column using latin1
, and the column will
contain a mix of sjis
and
latin1
values and cannot be converted
properly.
If you specified attributes when creating a column initially,
you should also specify them when altering the table with
ALTER TABLE
. For example, if you
specified NOT NULL
and an explicit
DEFAULT
value, you should also provide them
in the ALTER TABLE
statement.
Otherwise, the resulting column definition will not include
those attributes.
MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the permissible collations are listed.
You can always list the available character sets and their
default collations with the SHOW CHARACTER
SET
statement:
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset | Описание | Default collation |
+----------+-----------------------------+---------------------+
| big5 | Big5 Traditional Chinese | big5_chinese_ci |
| dec8 | DEC West European | dec8_swedish_ci |
| cp850 | DOS West European | cp850_general_ci |
| hp8 | HP West European | hp8_english_ci |
| koi8r | KOI8-R Relcom Russian | koi8r_general_ci |
| latin1 | cp1252 West European | latin1_swedish_ci |
| latin2 | ISO 8859-2 Central European | latin2_general_ci |
| swe7 | 7bit Swedish | swe7_swedish_ci |
| ascii | US ASCII | ascii_general_ci |
| ujis | EUC-JP Japanese | ujis_japanese_ci |
| sjis | Shift-JIS Japanese | sjis_japanese_ci |
| hebrew | ISO 8859-8 Hebrew | hebrew_general_ci |
| tis620 | TIS620 Thai | tis620_thai_ci |
| euckr | EUC-KR Korean | euckr_korean_ci |
| koi8u | KOI8-U Ukrainian | koi8u_general_ci |
| gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci |
| greek | ISO 8859-7 Greek | greek_general_ci |
| cp1250 | Windows Central European | cp1250_general_ci |
| gbk | GBK Simplified Chinese | gbk_chinese_ci |
| latin5 | ISO 8859-9 Turkish | latin5_turkish_ci |
| armscii8 | ARMSCII-8 Armenian | armscii8_general_ci |
| utf8 | UTF-8 Unicode | utf8_general_ci |
| ucs2 | UCS-2 Unicode | ucs2_general_ci |
| cp866 | DOS Russian | cp866_general_ci |
| keybcs2 | DOS Kamenicky Czech-Slovak | keybcs2_general_ci |
| macce | Mac Central European | macce_general_ci |
| macroman | Mac West European | macroman_general_ci |
| cp852 | DOS Central European | cp852_general_ci |
| latin7 | ISO 8859-13 Baltic | latin7_general_ci |
| utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci |
| cp1251 | Windows Cyrillic | cp1251_general_ci |
| utf16 | UTF-16 Unicode | utf16_general_ci |
| cp1256 | Windows Arabic | cp1256_general_ci |
| cp1257 | Windows Baltic | cp1257_general_ci |
| utf32 | UTF-32 Unicode | utf32_general_ci |
| binary | Binary pseudo charset | binary |
| geostd8 | GEOSTD8 Georgian | geostd8_general_ci |
| cp932 | SJIS for Windows Japanese | cp932_japanese_ci |
| eucjpms | UJIS for Windows Japanese | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
The utf8mb4
, utf16
, and
utf32
character sets were added in MySQL
5.5.3.
In cases where a character set has multiple collations, it might not be clear which collation is most suitable for a given application. To avoid choosing the wrong collation, it can be helpful to perform some comparisons with representative data values to make sure that a given collation sorts values the way you expect.
Collation-Charts.Org is a useful site for information that shows how one collation compares to another.
MySQL 5.5 supports these Unicode character sets:
ucs2
, the UCS-2 encoding of the Unicode character set using 16 bits per character.utf16
, the UTF-16 encoding for the Unicode character set; likeucs2
but with an extension for supplementary characters.utf32
, the UTF-32 encoding for the Unicode character set using 32 bits per character.utf8
, a UTF-8 encoding of the Unicode character set using one to three bytes per character.utf8mb4
, a UTF-8 encoding of the Unicode character set using one to four bytes per character.
ucs2
and utf8
support
Basic Multilingual Plane (BMP) characters.
utf8mb4
, utf16
, and
utf32
support BMP and supplementary
characters. The utf8mb4
,
utf16
, and utf32
character sets were added in MySQL 5.5.3.
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 9.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx
represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx
_danish_ciucs2_danish_ci
,
utf16_danish_ci
,
utf32_danish_ci
,
utf8_danish_ci
, and
utf8mb4_danish_ci
.
xxx
_binxxx
_czech_cixxx
_danish_cixxx
_esperanto_cixxx
_estonian_ci
(default)xxx
_general_cixxx
_hungarian_cixxx
_icelandic_cixxx
_latvian_cixxx
_lithuanian_cixxx
_persian_cixxx
_polish_cixxx
_roman_cixxx
_romanian_cixxx
_sinhala_cixxx
_slovak_cixxx
_slovenian_cixxx
_spanish_cixxx
_spanish2_cixxx
_swedish_cixxx
_turkish_cixxx
_unicode_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx
_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo.
xxx
_unicode_ci
MySQL implements language-specific Unicode collations only if
the ordering with
does not work well for a language. Language-specific
collations are UCA-based. They are derived from
xxx
_unicode_ci
with additional language tailoring rules.
xxx
_unicode_ci
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx
_general_ci
collation. For example, comparisons for the
xxx
_unicode_ciutf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
. The reason for this is
that utf8_unicode_ci
supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “ß
”
is equal to “ss
”.
utf8_unicode_ci
also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
(for the effect this has in
comparisons or when doing searches, see
Section 9.1.7.8, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
,
which supports the German DIN-1 ordering (also known as
dictionary order):
ß = ss
MySQL implements language-specific collations for the
utf8
character set only if the ordering
with utf8_unicode_ci
does not work well for
a language. For example, utf8_unicode_ci
works fine for German dictionary order and French, so there is
no need to create special utf8
collations.
utf8_general_ci
also is satisfactory for
both German and French, except that
“ß
” is equal to
“s
”, and not to
“ss
”. If this is acceptable
for your application, you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it
is more accurate.
includes Swedish rules. For example, in Swedish, the following
relationship holds, which is not something expected by a
German or French speaker:
xxx
_swedish_ci
Ü = Y < Ö
The
and
xxx
_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx
_spanish2_ciñ
” (n-tilde) is a separate
letter between “n
” and
“o
”. In addition, for
traditional Spanish, “ch
” is a
separate letter between “c
”
and “d
”, and
“ll
” is a separate letter
between “l
” and
“m
”
In the
collations, xxx
_roman_ciI
and J
compare as equal, and U
and
V
compare as equal.
For all Unicode collations except the “binary”
(
)
collations, MySQL performs a table lookup to find a
character's collating weight. If a character is not in the
table (for example, because it is a “new”
character), collating weight determination becomes more
complex:
xxx
_bin
For BMP characters in general collations (
), weight = code point.xxx
_general_ciFor BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm applies:xxx
_unicode_ciif (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
aaaa
followed bybbbb
.Thus,
U+04cf CYRILLIC SMALL LETTER PALOCHKA
currently is, with all UCA collations, greater thanU+04c0 CYRILLIC LETTER PALOCHKA
. Eventually, after further collation tuning, all palochkas will sort together.For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER
. For supplementary characters in UCA collations, their collating weight is0xfffd
. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.An example with Deseret characters and
COUNT(DISTINCT)
:CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL
collations, the replacement character has a weight ofxxx
_unicode_ci0x0dc6
, whereas Deseret Bee and Deseret Tee both have a weight of0xfffd
. (Were theutf32_general_ci
collation used instead, the result would be 1 because all three characters have a weight of0xfffd
in that collation.)
The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
The utf16_bin
Collation
There is a difference between “ordering by the
character's code value” and “ordering by the
character's binary representation,” a difference that
appears only with utf16_bin
, because of
surrogates.
Suppose that utf16_bin
(the binary
collation for utf16
) was a binary
comparison “byte by byte” rather than
“character by character.” If that were so, the
order of characters in utf16_bin
would
differ from the order in utf8_bin
. For
example, the following chart shows two rare characters. The
first character is in the range
E000
-FFFF
, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point
value because 0xff9d
<
0x10384
. And they are in order by
utf8
value because 0xef
< 0xf0
. But they are not in order by
utf16
value, if we use byte-by-byte
comparison, because 0xff
>
0xd8
.
So MySQL's utf16_bin
collation is not
“byte by byte.” It is “by code
point.” When MySQL sees a supplementary-character
encoding in utf16
, it converts to the
character's code-point value, and then compares. Therefore,
utf8_bin
and utf16_bin
are the same ordering. This is consistent with the SQL:2008
standard requirement for a UCS_BASIC collation:
“UCS_BASIC is a collation in which the ordering is
determined entirely by the Unicode scalar values of the
characters in the strings being sorted. It is applicable to
the UCS character repertoire. Since every character repertoire
is a subset of the UCS repertoire, the UCS_BASIC collation is
potentially applicable to every character set. NOTE 11: The
Unicode scalar value of a character is its code point treated
as an unsigned integer.”
If the character set is ucs2
, comparison is
byte-by-byte, but ucs2
strings should not
contain surrogates, anyway.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
ascii
(US ASCII) collations:ascii_bin
ascii_general_ci
(default)
cp850
(DOS West European) collations:cp850_bin
cp850_general_ci
(default)
dec8
(DEC Western European) collations:dec8_bin
dec8_swedish_ci
(default)
hp8
(HP Western European) collations:hp8_bin
hp8_english_ci
(default)
latin1
(cp1252 West European) collations:latin1_bin
latin1_danish_ci
latin1_general_ci
latin1_general_cs
latin1_german1_ci
latin1_german2_ci
latin1_spanish_ci
latin1_swedish_ci
(default)
latin1
is the default character set. MySQL'slatin1
is the same as the Windowscp1252
character set. This means it is the same as the officialISO 8859-1
or IANA (Internet Assigned Numbers Authority)latin1
, except that IANAlatin1
treats the code points between0x80
and0x9f
as “undefined,” whereascp1252
, and therefore MySQL'slatin1
, assign characters for those positions. For example,0x80
is the Euro sign. For the “undefined” entries incp1252
, MySQL translates0x81
to Unicode0x0081
,0x8d
to0x008d
,0x8f
to0x008f
,0x90
to0x0090
, and0x9d
to0x009d
.The
latin1_swedish_ci
collation is the default that probably is used by the majority of MySQL customers. Although it is frequently said that it is based on the Swedish/Finnish collation rules, there are Swedes and Finns who disagree with this statement.The
latin1_german1_ci
andlatin1_german2_ci
collations are based on the DIN-1 and DIN-2 standards, where DIN stands for Deutsches Institut für Normung (the German equivalent of ANSI). DIN-1 is called the “dictionary collation” and DIN-2 is called the “phone book collation.” For an example of the effect this has in comparisons or when doing searches, see Section 9.1.7.8, “Examples of the Effect of Collation”.latin1_german1_ci
(dictionary) rules:Ä = A Ö = O Ü = U ß = s
latin1_german2_ci
(phone-book) rules:Ä = AE Ö = OE Ü = UE ß = ss
In the
latin1_spanish_ci
collation, “ñ
” (n-tilde) is a separate letter between “n
” and “o
”.macroman
(Mac West European) collations:macroman_bin
macroman_general_ci
(default)
swe7
(7bit Swedish) collations:swe7_bin
swe7_swedish_ci
(default)
For additional information about Western European collations in MySQL, see Collation-Charts.Org (ascii, cp850, dec8, hp8, latin1, macroman, swe7).
MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).
cp1250
(Windows Central European) collations:cp1250_bin
cp1250_croatian_ci
cp1250_czech_cs
cp1250_general_ci
(default)cp1250_polish_ci
cp852
(DOS Central European) collations:cp852_bin
cp852_general_ci
(default)
keybcs2
(DOS Kamenicky Czech-Slovak) collations:keybcs2_bin
keybcs2_general_ci
(default)
latin2
(ISO 8859-2 Central European) collations:latin2_bin
latin2_croatian_ci
latin2_czech_cs
latin2_general_ci
(default)latin2_hungarian_ci
macce
(Mac Central European) collations:macce_bin
macce_general_ci
(default)
For additional information about Central European collations in MySQL, see Collation-Charts.Org (cp1250, cp852, keybcs2, latin2, macce).
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.
armscii8
(ARMSCII-8 Armenian) collations:armscii8_bin
armscii8_general_ci
(default)
cp1256
(Windows Arabic) collations:cp1256_bin
cp1256_general_ci
(default)
geostd8
(GEOSTD8 Georgian) collations:geostd8_bin
geostd8_general_ci
(default)
greek
(ISO 8859-7 Greek) collations:greek_bin
greek_general_ci
(default)
hebrew
(ISO 8859-8 Hebrew) collations:hebrew_bin
hebrew_general_ci
(default)
latin5
(ISO 8859-9 Turkish) collations:latin5_bin
latin5_turkish_ci
(default)
For additional information about South European and Middle Eastern collations in MySQL, see Collation-Charts.Org (armscii8, cp1256, geostd8, greek, hebrew, latin5).
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.
cp1257
(Windows Baltic) collations:cp1257_bin
cp1257_general_ci
(default)cp1257_lithuanian_ci
latin7
(ISO 8859-13 Baltic) collations:latin7_bin
latin7_estonian_cs
latin7_general_ci
(default)latin7_general_cs
For additional information about Baltic collations in MySQL, see Collation-Charts.Org (cp1257, latin7).
The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.
cp1251
(Windows Cyrillic) collations:cp1251_bin
cp1251_bulgarian_ci
cp1251_general_ci
(default)cp1251_general_cs
cp1251_ukrainian_ci
cp866
(DOS Russian) collations:cp866_bin
cp866_general_ci
(default)
koi8r
(KOI8-R Relcom Russian) collations:koi8r_bin
koi8r_general_ci
(default)
koi8u
(KOI8-U Ukrainian) collations:koi8u_bin
koi8u_general_ci
(default)
For additional information about Cyrillic collations in MySQL, see Collation-Charts.Org (cp1251, cp866, koi8r, koi8u). ).
The Asian character sets that we support include Chinese,
Japanese, Korean, and Thai. These can be complicated. For
example, the Chinese sets must allow for thousands of
different characters. See Section 9.1.14.7.1, “The cp932
Character Set”, for
additional information about the cp932
and
sjis
character sets.
For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section B.11, “MySQL 5.5 FAQ: MySQL Chinese, Japanese, and Korean Character Sets”.
big5
(Big5 Traditional Chinese) collations:big5_bin
big5_chinese_ci
(default)
cp932
(SJIS for Windows Japanese) collations:cp932_bin
cp932_japanese_ci
(default)
eucjpms
(UJIS for Windows Japanese) collations:eucjpms_bin
eucjpms_japanese_ci
(default)
euckr
(EUC-KR Korean) collations:euckr_bin
euckr_korean_ci
(default)
gb2312
(GB2312 Simplified Chinese) collations:gb2312_bin
gb2312_chinese_ci
(default)
gbk
(GBK Simplified Chinese) collations:gbk_bin
gbk_chinese_ci
(default)
sjis
(Shift-JIS Japanese) collations:sjis_bin
sjis_japanese_ci
(default)
tis620
(TIS620 Thai) collations:tis620_bin
tis620_thai_ci
(default)
ujis
(EUC-JP Japanese) collations:ujis_bin
ujis_japanese_ci
(default)
The big5_chinese_ci
collation sorts on
number of strokes.
For additional information about Asian collations in MySQL, see Collation-Charts.Org (big5, cp932, eucjpms, euckr, gb2312, gbk, sjis, tis620, ujis).
Why is cp932
needed?
In MySQL, the sjis
character set
corresponds to the Shift_JIS
character
set defined by IANA, which supports JIS X0201 and JIS X0208
characters. (See
http://www.iana.org/assignments/character-sets.)
However, the meaning of “SHIFT JIS” as a
descriptive term has become very vague and it often includes
the extensions to Shift_JIS
that are
defined by various vendors.
For example, “SHIFT JIS” used in Japanese
Windows environments is a Microsoft extension of
Shift_JIS
and its exact name is
Microsoft Windows Codepage : 932
or
cp932
. In addition to the characters
supported by Shift_JIS
,
cp932
supports extension characters such
as NEC special characters, NEC selected—IBM extended
characters, and IBM selected characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
Character sets are converted using Unicode (
ucs2
).The
sjis
character set does not support the conversion of these extension characters.There are several conversion rules from so-called “SHIFT JIS” to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
The MySQL cp932
character set is designed
to solve these problems.
Because MySQL supports character set conversion, it is
important to separate IANA Shift_JIS
and
cp932
into two different character sets
because they provide different conversion rules.
How does cp932
differ from sjis
?
The cp932
character set differs from
sjis
in the following ways:
cp932
supports NEC special characters, NEC selected—IBM extended characters, and IBM selected characters.Some
cp932
characters have two different code points, both of which convert to the same Unicode code point. When converting from Unicode back tocp932
, one of the code points must be selected. For this “round trip conversion,” the rule recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected—IBM extended characters, use the code point of IBM extended characters.
The table shown at http://www.microsoft.com/globaldev/reference/dbcs/932.htm provides information about the Unicode values of
cp932
characters. Forcp932
table entries with characters under which a four-digit number appears, the number represents the corresponding Unicode (ucs2
) encoding. For table entries with an underlined two-digit value appears, there is a range ofcp932
character values that begin with those two digits. Clicking such a table entry takes you to a page that displays the Unicode value for each of thecp932
characters that begin with those digits.The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
NEC selected—IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
cp932
supports conversion of user-defined characters in combination witheucjpms
, and solves the problems withsjis
/ujis
conversion. For details, please refer to http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from
ucs2
is different for
sjis
and cp932
. The
following tables illustrate these differences.
Conversion to ucs2
:
sjis /cp932 Value | sjis -> ucs2 Conversion | cp932 -> ucs2 Conversion |
---|---|---|
5C | 005C | 005C |
7E | 007E | 007E |
815C | 2015 | 2015 |
815F | 005C | FF3C |
8160 | 301C | FF5E |
8161 | 2016 | 2225 |
817C | 2212 | FF0D |
8191 | 00A2 | FFE0 |
8192 | 00A3 | FFE1 |
81CA | 00AC | FFE2 |
Conversion from ucs2
:
ucs2 value | ucs2 -> sjis Conversion | ucs2 -> cp932 Conversion |
---|---|---|
005C | 815F | 5C |
007E | 7E | 7E |
00A2 | 8191 | 3F |
00A3 | 8192 | 3F |
00AC | 81CA | 3F |
2015 | 815C | 815C |
2016 | 8161 | 3F |
2212 | 817C | 3F |
2225 | 3F | 8161 |
301C | 8160 | 3F |
FF0D | 3F | 817C |
FF3C | 3F | 815F |
FF5E | 3F | 8160 |
FFE0 | 3F | 8191 |
FFE1 | 3F | 8192 |
FFE2 | 3F | 81CA |
Users of any Japanese character sets should be aware that
using
--character-set-client-handshake
(or
--skip-character-set-client-handshake
)
has an important effect. See
Section 5.1.2, “Server Command Options”.
By default, mysqld produces error messages in English, but they can also be displayed in any of several other languages: Czech, Danish, Dutch, Estonian, French, German, Greek, Hungarian, Italian, Japanese, Korean, Norwegian, Norwegian-ny, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, or Swedish.
You can select which language the server uses for error messages using the instructions in this section.
As of MySQL 5.5, the server searches for the error message file in two locations:
It tries to find the file in a directory constructed from two system variable values,
lc_messages_dir
andlc_messages
, with the latter converted to a language name. Suppose that you start the server using this command:shell>
mysqld --lc_messages_dir=/usr/share/mysql --lc_messages=fr_FR
In this case, mysqld maps the locale
fr_FR
to the languagefrench
and looks for the error file in the/usr/share/mysql/french
directory.If the message file cannot be found in the directory constructed as just described, the server ignores the
lc_messages
value and uses only thelc_messages_dir
value as the location in which to look.
The lc_messages_dir
system
variable has only a global value and is read only.
lc_messages
has global and
session values and can be modified at runtime, so the error
message language can be changed while the server is running, and
individual clients each can have a different error message
language by changing their session
lc_messages
value to a different
locale name. For example, if the server is using the
fr_FR
locale for error messages, a client can
execute this statement to receive error messages in English:
mysql> SET lc_messages = 'en_US';
Before MySQL 5.5, the
lc_messages_dir
and
lc_messages
system variables were
unavailable. To start mysqld with a particular
language for error messages, the
--language
or -L
option were used. The option value can be a language name or the
full path to the error message file. For example:
shell> mysqld --language=french
Or:
shell> mysqld --language=/usr/local/share/french
Specify the language name in lowercase.
MySQL 5.5 treats
--language
as an alias for
lc_messages_dir
.
By default, the language files are located in the
share/mysql/
directory under the MySQL base directory.
LANGUAGE
For information about changing the character set for error messages (rather than the language), see Section 9.1.6, “Character Set for Error Messages”.
You can change the content of the error messages produced by the server using the instructions in the MySQL Internals manual, available at http://forge.mysql.com/wiki/MySQL_Internals_Error_Messages. If you do change the content of error messages, remember to repeat your changes after each upgrade to a newer version of MySQL.
This section discusses the procedure for adding a character set to MySQL. The proper procedure depends on whether the character set is simple or complex:
If the character set does not need special string collating routines for sorting and does not need multi-byte character support, it is simple.
If the character set needs either of those features, it is complex.
For example, greek
and swe7
are simple character sets, whereas big5
and
czech
are complex character sets.
To use the following instructions, you must have a MySQL source
distribution. In the instructions,
MYSET
represents the name of the
character set that you want to add.
Add a
<charset>
element forMYSET
to thesql/share/charsets/Index.xml
file. Use the existing contents in the file as a guide to adding new contents. A partial listing for thelatin1
<charset>
element follows:<charset name="latin1"> <family>Western</family> <description>cp1252 West European</description> ... <collation name="latin1_swedish_ci" id="8" order="Finnish, Swedish"> <flag>primary</flag> <flag>compiled</flag> </collation> <collation name="latin1_danish_ci" id="15" order="Danish"/> ... <collation name="latin1_bin" id="47" order="Binary"> <flag>binary</flag> <flag>compiled</flag> </collation> ... </charset>
The
<charset>
element must list all the collations for the character set. These must include at least a binary collation and a default (primary) collation. The default collation is often named using a suffix ofgeneral_ci
(general, case insensitive). It is possible for the binary collation to be the default collation, but usually they are different. The default collation should have aprimary
flag. The binary collation should have abinary
flag.You must assign a unique ID number to each collation. The range of IDs from 1024 to 2047 is reserved for user-defined collations. Before MySQL 5.5, the ID must be chosen from the range 1 to 254. To find the maximum of the currently used collation IDs, use this query:
SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multi-byte functions, or both.
For a simple character set, create a configuration file,
, that describes the character set properties. Create this file in theMYSET
.xmlsql/share/charsets
directory. You can use a copy oflatin1.xml
as the basis for this file. The syntax for the file is very simple:Comments are written as ordinary XML comments (
<!--
).text
-->Words within
<map>
array elements are separated by arbitrary amounts of whitespace.Each word within
<map>
array elements must be a number in hexadecimal format.The
<map>
array element for the<ctype>
element has 257 words. The other<map>
array elements after that have 256 words. See Section 9.3.1, “Character Definition Arrays”.For each collation listed in the
<charset>
element for the character set inIndex.xml
,
must contain aMYSET
.xml<collation>
element that defines the character ordering.
For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:
Create the file
ctype-
in theMYSET
.cstrings
directory. Look at one of the existingctype-*.c
files (such asctype-big5.c
) to see what needs to be defined. The arrays in your file must have names likectype_
,MYSET
to_lower_
, and so on. These correspond to the arrays for a simple character set. See Section 9.3.1, “Character Definition Arrays”.MYSET
For each
<collation>
element listed in the<charset>
element for the character set inIndex.xml
, thectype-
file must provide an implementation of the collation.MYSET
.cIf the character set requires string collating functions, see Section 9.3.2, “String Collating Support for Complex Character Sets”.
If the character set requires multi-byte character support, see Section 9.3.3, “Multi-Byte Character Support for Complex Character Sets”.
Modify the configuration information. Use the existing configuration information as a guide to adding information for
MYSYS
. The example here assumes that the character set has default and binary collations, but more lines are needed ifMYSET
has additional collations.Edit
mysys/charset-def.c
, and “register” the collations for the new character set.Add these lines to the “declaration” section:
#ifdef HAVE_CHARSET_
MYSET
extern CHARSET_INFO my_charset_MYSET
_general_ci; extern CHARSET_INFO my_charset_MYSET
_bin; #endifAdd these lines to the “registration” section:
#ifdef HAVE_CHARSET_
MYSET
add_compiled_collation(&my_charset_MYSET
_general_ci); add_compiled_collation(&my_charset_MYSET
_bin); #endifIf the character set uses
ctype-
, editMYSET
.cstrings/CMakeLists.txt
and addctype-
to the definition of theMYSET
.cSTRINGS_SOURCES
variable.Edit
cmake/character_sets.cmake
:Add
MYSET
to the value of withCHARSETS_AVAILABLE
in alphabetic order.Add
MYSET
to the value ofCHARSETS_COMPLEX
in alphabetic order. This is needed even for simple character sets, or CMake will not recognize-DDEFAULT_CHARSET=
.MYSET
Reconfigure, recompile, and test.
Each simple character set has a configuration file located in
the sql/share/charsets
directory. For a
character set named MYSYS
, the file
is named
. It
uses MYSET
.xml<map>
array elements to list
character set properties. <map>
elements appear within these elements:
<ctype>
defines attributes for each character.<lower>
and<upper>
list the lowercase and uppercase characters.<unicode>
maps 8-bit character values to Unicode values.<collation>
elements indicate character ordering for comparisons and sorts, one element per collation. Binary collations need no<map>
element because the character codes themselves provide the ordering.
For a complex character set as implemented in a
ctype-
file in the MYSET
.cstrings
directory, there are
corresponding arrays:
ctype_
,
MYSET
[]to_lower_
,
and so forth. Not every complex character set has all of the
arrays. See also the existing MYSET
[]ctype-*.c
files for examples. See the
CHARSET_INFO.txt
file in the
strings
directory for additional
information.
Most of the arrays are indexed by character value and have 256
elements. The <ctype>
array is indexed
by character value + 1 and has 257 elements. This is a legacy
convention for handling EOF
.
<ctype>
array elements are bit values.
Each element describes the attributes of a single character in
the character set. Each attribute is associated with a bitmask,
as defined in include/m_ctype.h
:
#define _MY_U 01 /* Upper case */ #define _MY_L 02 /* Lower case */ #define _MY_NMR 04 /* Numeral (digit) */ #define _MY_SPC 010 /* Spacing character */ #define _MY_PNT 020 /* Punctuation */ #define _MY_CTR 040 /* Control character */ #define _MY_B 0100 /* Blank */ #define _MY_X 0200 /* heXadecimal digit */
The <ctype>
value for a given character
should be the union of the applicable bitmask values that
describe the character. For example, 'A'
is
an uppercase character (_MY_U
) as well as a
hexadecimal digit (_MY_X
), so its
ctype
value should be defined like this:
ctype['A'+1] = _MY_U | _MY_X = 01 | 0200 = 0201
The bitmask values in m_ctype.h
are octal
values, but the elements of the <ctype>
array in
should
be written as hexadecimal values.
MYSET
.xml
The <lower>
and
<upper>
arrays hold the lowercase and
uppercase characters corresponding to each member of the
character set. For example:
lower['A'] should contain 'a' upper['a'] should contain 'A'
Each <collation>
array indicates how
characters should be ordered for comparison and sorting
purposes. MySQL sorts characters based on the values of this
information. In some cases, this is the same as the
<upper>
array, which means that sorting
is case-insensitive. For more complicated sorting rules (for
complex character sets), see the discussion of string collating
in Section 9.3.2, “String Collating Support for Complex Character Sets”.
For a simple character set named
MYSET
, sorting rules are specified in
the
configuration file using MYSET
.xml<map>
array
elements within <collation>
elements.
If the sorting rules for your language are too complex to be
handled with simple arrays, you must define string collating
functions in the
ctype-
source file in the MYSET
.cstrings
directory.
The existing character sets provide the best documentation and
examples to show how these functions are implemented. Look at
the ctype-*.c
files in the
strings
directory, such as the files for
the big5
, czech
,
gbk
, sjis
, and
tis160
character sets. Take a look at the
MY_COLLATION_HANDLER
structures to see how
they are used. See also the
CHARSET_INFO.txt
file in the
strings
directory for additional
information.
If you want to add support for a new character set named
MYSET
that includes multi-byte
characters, you must use multi-byte character functions in the
ctype-
source file in the MYSET
.cstrings
directory.
The existing character sets provide the best documentation and
examples to show how these functions are implemented. Look at
the ctype-*.c
files in the
strings
directory, such as the files for
the euc_kr
, gb2312
,
gbk
, sjis
, and
ujis
character sets. Take a look at the
MY_CHARSET_HANDLER
structures to see how they
are used. See also the CHARSET_INFO.txt
file in the strings
directory for
additional information.
A collation is a set of rules that defines how to compare and sort character strings. Each collation in MySQL belongs to a single character set. Every character set has at least one collation, and most have two or more collations.
A collation orders characters based on weights. Each character in a character set maps to a weight. Characters with equal weights compare as equal, and characters with unequal weights compare according to the relative magnitude of their weights.
MySQL supports several collation implementations, as discussed in Section 9.4.1, “Collation Implementation Types”. Some of these can be added to MySQL without recompiling:
Simple collations for 8-bit character sets.
UCA-based collations for Unicode character sets.
Binary (
) collations.xxx
_bin
The following sections describe how to add collations of the first two types to existing character sets. All existing character sets already have a binary collation, so there is no need here to describe how to add one.
Summary of the procedure for adding a new collation:
Choose a collation ID.
Add configuration information that names the collation and describes the character-ordering rules.
Restart the server.
Verify that the collation is present.
The instructions here cover only collations that can be added without recompiling MySQL. To add a collation that does require recompiling (as implemented by means of functions in a C source file), use the instructions in Section 9.3, “Adding a Character Set”. However, instead of adding all the information required for a complete character set, just modify the appropriate files for an existing character set. That is, based on what is already present for the character set's current collations, add data structures, functions, and configuration information for the new collation.
If you modify an existing collation, that may affect the ordering of rows for indexes on columns that use the collation. In this case, rebuild any such indexes to avoid problems such as incorrect query results. For further information, see Section 2.11.3, “Checking Whether Tables or Indexes Must Be Rebuilt”.
Additional Resources
The Unicode Collation Algorithm (UCA) specification: http://www.unicode.org/reports/tr10/
The Locale Data Markup Language (LDML) specification: http://www.unicode.org/reports/tr35/
MySQL University session “How to Add a Collation”: http://forge.mysql.com/wiki/How_to_Add_a_Collation
MySQL Blog article “Instructions for adding a new Unicode collation”: http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/
MySQL implements several types of collations:
Simple collations for 8-bit character sets
This kind of collation is implemented using an array of 256
weights that defines a one-to-one mapping from character codes
to weights. latin1_swedish_ci
is an example.
It is a case-insensitive collation, so the uppercase and
lowercase versions of a character have the same weights and they
compare as equal.
mysql>SET NAMES 'latin1' COLLATE 'latin1_swedish_ci';
Query OK, 0 rows affected (0.00 sec) mysql>SELECT 'a' = 'A';
+-----------+ | 'a' = 'A' | +-----------+ | 1 | +-----------+ 1 row in set (0.00 sec)
For implementation instructions, see Section 9.4.3, “Adding a Simple Collation to an 8-Bit Character Set”.
Complex collations for 8-bit character sets
This kind of collation is implemented using functions in a C source file that define how to order characters, as described in Section 9.3, “Adding a Character Set”.
Collations for non-Unicode multi-byte character sets
For this type of collation, 8-bit (single-byte) and multi-byte
characters are handled differently. For 8-bit characters,
character codes map to weights in case-insensitive fashion. (For
example, the single-byte characters 'a'
and
'A'
both have a weight of
0x41
.) For multi-byte characters, there are
two types of relationship between character codes and weights:
Weights equal character codes.
sjis_japanese_ci
is an example of this kind of collation. The multi-byte character'ぢ'
has a character code of0x82C0
, and the weight is also0x82C0
.Character codes map one-to-one to weights, but a code is not necessarily equal to the weight.
gbk_chinese_ci
is an example of this kind of collation. The multi-byte character'膰'
has a character code of0x81B0
but a weight of0xC286
.
For implementation instructions, see Section 9.3, “Adding a Character Set”.
Collations for Unicode multi-byte character sets
Some of these collations are based on the Unicode Collation Algorithm (UCA), others are not.
Non-UCA collations have a one-to-one mapping from character code
to weight. In MySQL, such collations are case insensitive and
accent insensitive. utf8_general_ci
is an
example: 'a'
, 'A'
,
'À'
, and 'á'
each have
different character codes but all have a weight of
0x0041
and compare as equal.
mysql>SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec) mysql>SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+ | 'a' = 'A' | 'a' = 'À' | 'a' = 'á' | +-----------+-----------+-----------+ | 1 | 1 | 1 | +-----------+-----------+-----------+ 1 row in set (0.06 sec)
UCA-based collations in MySQL have these properties:
If a character has weights, each weight uses 2 bytes (16 bits).
A character may have zero weights (or an empty weight). In this case, the character is ignorable. Пример: "U+0000 NULL" does not have a weight and is ignorable.
A character may have one weight. Пример:
'a'
has a weight of0x0E33
.A character may have many weights. This is an expansion. Пример: The German letter
'ß'
(SZ ligature, or SHARP S) has a weight of0x0FEA0FEA
.Many characters may have one weight. This is a contraction. Пример:
'ch'
is a single letter in Czech and has a weight of0x0EE2
.
A many-characters-to-many-weights mapping is also possible (this is contraction with expansion), but is not supported by MySQL.
For implementation instructions, for a non-UCA colluation, see Section 9.3, “Adding a Character Set”. For a UCA collation, see Section 9.4.4, “Adding a UCA Collation to a Unicode Character Set”.
Miscellaneous collations
There are also a few collations that do not fall into any of the previous categories.
Each collation must have a unique ID. To add a collation, you must choose an ID value that is not currently used. The range of IDs from 1024 to 2047 is reserved for user-defined collations. Before MySQL 5.5, the ID must be chosen from the range 1 to 254. The collation ID that you choose will appear in these contexts:
The
ID
column of theINFORMATION_SCHEMA.COLLATIONS
table.The
Id
column ofSHOW COLLATION
output.The
charsetnr
member of theMYSQL_FIELD
C API data structure.The
number
member of theMY_CHARSET_INFO
data structure returned by themysql_get_character_set_info()
C API function.
To determine the largest currently used ID, issue the following statement:
mysql> SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;
+---------+
| MAX(ID) |
+---------+
| 210 |
+---------+
To display a list of all currently used IDs, issue this statement:
mysql> SELECT ID FROM INFORMATION_SCHEMA.COLLATIONS ORDER BY ID;
+-----+
| ID |
+-----+
| 1 |
| 2 |
| ... |
| 52 |
| 53 |
| 57 |
| 58 |
| ... |
| 98 |
| 99 |
| 128 |
| 129 |
| ... |
| 210 |
+-----+
Before MySQL 5.5, which provides for a range of user-defined collation IDs, you must choose an ID in the range from 1 to 254. In this case, if you upgrade MySQL, you may find that the collation ID you choose has been assigned to a collation included in the new MySQL distribution. In this case, you will need to choose a new value for your own collation.
In addition, before upgrading, you should save the configuration files that you change. If you upgrade in place, the process will replace the your modified files.
This section describes how to add a simple collation for an
8-bit character set by writing the
<collation>
elements associated with a
<charset>
character set description in
the MySQL Index.xml
file. The procedure
described here does not require recompiling MySQL. The example
adds a collation named latin1_test_ci
to the
latin1
character set.
Choose a collation ID, as shown in Section 9.4.2, “Choosing a Collation ID”. The following steps use an ID of 1024.
Modify the
Index.xml
andlatin1.xml
configuration files. These files will be located in the directory named by thecharacter_sets_dir
system variable. You can check the variable value as follows, although the path name might be different on your system:mysql>
SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+ | Variable_name | Value | +--------------------+-----------------------------------------+ | character_sets_dir | /user/local/mysql/share/mysql/charsets/ | +--------------------+-----------------------------------------+Choose a name for the collation and list it in the
Index.xml
file. Find the<charset>
element for the character set to which the collation is being added, and add a<collation>
element that indicates the collation name and ID, to associate the name with the ID. For example:<charset name="latin1"> ... <collation name="latin1_test_ci" id="1024"/> ... </charset>
In the
latin1.xml
configuration file, add a<collation>
element that names the collation and that contains a<map>
element that defines a character code-to-weight mapping table for character codes 0 to 255. Each value within the<map>
element must be a number in hexadecimal format.<collation name="latin1_test_ci"> <map> 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49 44 4E 4F 4F 4F 4F 5C D7 5C 55 55 55 59 59 DE DF 41 41 41 41 5B 5D 5B 43 45 45 45 45 49 49 49 49 44 4E 4F 4F 4F 4F 5C F7 5C 55 55 55 59 59 DE FF </map> </collation>
Restart the server and use this statement to verify that the collation is present:
mysql>
SHOW COLLATION LIKE 'latin1_test_ci';
+----------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +----------------+---------+------+---------+----------+---------+ | latin1_test_ci | latin1 | 1024 | | | 1 | +----------------+---------+------+---------+----------+---------+
This section describes how to add a UCA collation for a Unicode
character set by writing the
<collation>
element within a
<charset>
character set description in
the MySQL Index.xml
file. The procedure
described here does not require recompiling MySQL. It uses a
subset of the Locale Data Markup Language (LDML) specification,
which is available at
http://www.unicode.org/reports/tr35/. With this
method, you need not define the entire collation. Instead, you
begin with an existing “base” collation and
describe the new collation in terms of how it differs from the
base collation. The following table lists the base collations of
the Unicode character sets for which UCA collations can be
defined.
Table 9.1. MySQL Character Sets Available for User-Defined UCA Collations
Character Set | Base Collation |
---|---|
utf8 | utf8_unicode_ci |
ucs2 | ucs2_unicode_ci |
utf16 | utf16_unicode_ci |
utf32 | utf32_unicode_ci |
The following sections show how to add a collation that is defined using LDML syntax, and provide a summary of LDML rules supported in MySQL.
To add a UCA collation for a Unicode character set without recompiling MySQL, use the following procedure. If you are unfamiliar with the LDML rules used to describe the collation's sort characteristics, see Section 9.4.4.2, “LDML Синтаксис Supported in MySQL”.
The example adds a collation named
utf8_phone_ci
to the
utf8
character set. The collation is
designed for a scenario involving a Web application for which
users post their names and phone numbers. Phone numbers can be
given in very different formats:
+7-12345-67 +7-12-345-67 +7 12 345 67 +7 (12) 345 67 +71234567
The problem raised by dealing with these kinds of values is that the varying permissible formats make searching for a specific phone number very difficult. The solution is to define a new collation that reorders punctuation characters, making them ignorable.
Choose a collation ID, as shown in Section 9.4.2, “Choosing a Collation ID”. The following steps use an ID of 1029.
To modify the
Index.xml
configuration file. This file will be located in the directory named by thecharacter_sets_dir
system variable. You can check the variable value as follows, although the path name might be different on your system:mysql>
SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+-----------------------------------------+ | Variable_name | Value | +--------------------+-----------------------------------------+ | character_sets_dir | /user/local/mysql/share/mysql/charsets/ | +--------------------+-----------------------------------------+Choose a name for the collation and list it in the
Index.xml
file. In addition, you'll need to provide the collation ordering rules. Find the<charset>
element for the character set to which the collation is being added, and add a<collation>
element that indicates the collation name and ID, to associate the name with the ID. Within the<collation>
element, provide a<rules>
element containing the ordering rules:<charset name="utf8"> ... <collation name="utf8_phone_ci" id="1029"> <rules> <reset>\u0000</reset> <i>\u0020</i> <!-- space --> <i>\u0028</i> <!-- left parenthesis --> <i>\u0029</i> <!-- right parenthesis --> <i>\u002B</i> <!-- plus --> <i>\u002D</i> <!-- hyphen --> </rules> </collation> ... </charset>
If you want a similar collation for other Unicode character sets, add other
<collation>
elements. For example, to defineucs2_phone_ci
, add a<collation>
element to the<charset name="ucs2">
element. Remember that each collation must have its own unique ID.Restart the server and use this statement to verify that the collation is present:
mysql>
SHOW COLLATION LIKE 'utf8_phone_ci';
+---------------+---------+------+---------+----------+---------+ | Collation | Charset | Id | Default | Compiled | Sortlen | +---------------+---------+------+---------+----------+---------+ | utf8_phone_ci | utf8 | 1029 | | | 8 | +---------------+---------+------+---------+----------+---------+
Now test the collation to make sure that it has the desired properties.
Create a table containing some sample phone numbers using the new collation:
mysql>CREATE TABLE phonebook (
->name VARCHAR(64),
->phone VARCHAR(64) CHARACTER SET utf8 COLLATE utf8_phone_ci
->);
Query OK, 0 rows affected (0.09 sec) mysql>INSERT INTO phonebook VALUES ('Svoj','+7 912 800 80 02');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Hf','+7 (912) 800 80 04');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Bar','+7-912-800-80-01');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Ramil','(7912) 800 80 03');
Query OK, 1 row affected (0.00 sec) mysql>INSERT INTO phonebook VALUES ('Sanja','+380 (912) 8008005');
Query OK, 1 row affected (0.00 sec)
Run some queries to see whether the ignored punctuation characters are in fact ignored for sorting and comparisons:
mysql>SELECT * FROM phonebook ORDER BY phone;
+-------+--------------------+ | name | phone | +-------+--------------------+ | Sanja | +380 (912) 8008005 | | Bar | +7-912-800-80-01 | | Svoj | +7 912 800 80 02 | | Ramil | (7912) 800 80 03 | | Hf | +7 (912) 800 80 04 | +-------+--------------------+ 5 rows in set (0.00 sec) mysql>SELECT * FROM phonebook WHERE phone='+7(912)800-80-01';
+------+------------------+ | name | phone | +------+------------------+ | Bar | +7-912-800-80-01 | +------+------------------+ 1 row in set (0.00 sec) mysql>SELECT * FROM phonebook WHERE phone='79128008001';
+------+------------------+ | name | phone | +------+------------------+ | Bar | +7-912-800-80-01 | +------+------------------+ 1 row in set (0.00 sec) mysql>SELECT * FROM phonebook WHERE phone='7 9 1 2 8 0 0 8 0 0 1';
+------+------------------+ | name | phone | +------+------------------+ | Bar | +7-912-800-80-01 | +------+------------------+ 1 row in set (0.00 sec)
This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the LDML specification available at http://www.unicode.org/reports/tr35/, which should be consulted for further information. The rules described here are all supported except that character sorting occurs only at the primary level. Rules that specify differences at secondary or higher sort levels are recognized (and thus can be included in collation definitions) but are treated as equality at the primary level.
Character Representation
Characters named in LDML rules can be written in
\u
format,
where nnnn
nnnn
is the hexadecimal
Unicode code point value. Within hexadecimal values, the
digits A
through F
are
not case sensitive; \u00E1
and
\u00e1
are equivalent. Basic Latin letters
A-Z
and a-z
can also be
written literally (this is a MySQL limitation; the LDML
specification permits literal non-Latin1 characters in the
rules). Only characters in the Basic Multilingual Plane can be
specified. This notation does not apply to characters outside
the BMP range of 0000
to
FFFF
.
The Index.xml
file itself should be
written using ASCII encoding.
Syntax Rules
LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.
A
<reset>
rule does not specify any ordering in and of itself. Instead, it “resets” the ordering for subsequent shift rules to cause them to be taken in relation to a given character. Either of the following rules resets subsequent shift rules to be taken in relation to the letter'A'
:<reset>A</reset> <reset>\u0041</reset>
The
<p>
,<s>
, and<t>
shift rules define primary, secondary, and tertiary differences of a character from another character:Use primary differences to distinguish separate letters.
Use secondary differences to distinguish accent variations.
Use tertiary differences to distinguish lettercase variations.
Either of these rules specifies a primary shift rule for the
'G'
character:<p>G</p> <p>\u0047</p>
The
<i>
shift rule indicates that one character sorts identically to another. The following rules cause'b'
to sort the same as'a'
:<reset>a</reset> <i>b</i>
The
<i>
shift rules is supported as of MySQL 5.5.3. Prior to 5.5.3, use<s> ... </s>
instead.
You can change the default server character set and collation with
the --character-set-server
and
--collation-server
options when you
start the server. The collation must be a legal collation for the
default character set. (Use the SHOW
COLLATION
statement to determine which collations are
available for each character set.) See
Section 5.1.2, “Server Command Options”.
If you try to use a character set that is not compiled into your binary, you might run into the following problems:
Your program uses an incorrect path to determine where the character sets are stored (which is typically the
share/mysql/charsets
orshare/charsets
directory under the MySQL installation directory). This can be fixed by using the--character-sets-dir
option when you run the program in question. For example, to specify a directory to be used by MySQL client programs, list it in the[client]
group of your option file. The examples given here show what the setting might look like for Unix or Windows, respectively:[client] character-sets-dir=/usr/local/mysql/share/mysql/charsets [client] character-sets-dir="C:/Program Files/MySQL/MySQL Server 5.5/share/charsets"
The character set is a complex character set that cannot be loaded dynamically. In this case, you must recompile the program with support for the character set.
For Unicode character sets, you can define collations without recompiling by using LDML notation. See Section 9.4.4, “Adding a UCA Collation to a Unicode Character Set”.
The character set is a dynamic character set, but you do not have a configuration file for it. In this case, you should install the configuration file for the character set from a new MySQL distribution.
If your character set index file does not contain the name for the character set, your program displays an error message. The file is named
Index.xml
and the message is:Character set '
charset_name
' is not a compiled character set and is not specified in the '/usr/share/mysql/charsets/Index.xml' fileTo solve this problem, you should either get a new index file or manually add the name of any missing character sets to the current file.
You can force client programs to use specific character set as follows:
[client]
default-character-set=charset_name
This is normally unnecessary. However, when
character_set_system
differs from
character_set_server
or
character_set_client
, and you
input characters manually (as database object identifiers, column
values, or both), these may be displayed incorrectly in output
from the client or the output itself may be formatted incorrectly.
In such cases, starting the mysql client with
--default-character-set=
—that
is, setting the client character set to match the system character
set—should fix the problem.
system_character_set
For MyISAM
tables, you can check the character
set name and number for a table with myisamchk -dvv
tbl_name
.
The MySQL server maintains several time zone settings:
The system time zone. When the server starts, it attempts to determine the time zone of the host machine and uses it to set the
system_time_zone
system variable. The value does not change thereafter.You can set the system time zone for MySQL Server at startup with the
--timezone=
option to mysqld_safe. You can also set it by setting thetimezone_name
TZ
environment variable before you start mysqld. The permissible values for--timezone
orTZ
are system dependent. Consult your operating system documentation to see what values are acceptable.The server's current time zone. The global
time_zone
system variable indicates the time zone the server currently is operating in. The initial value fortime_zone
is'SYSTEM'
, which indicates that the server time zone is the same as the system time zone.The initial global server time zone value can be specified explicitly at startup with the
--default-time-zone=
option on the command line, or you can use the following line in an option file:timezone
default-time-zone='
timezone
'If you have the
SUPER
privilege, you can set the global server time zone value at runtime with this statement:mysql>
SET GLOBAL time_zone =
timezone
;Per-connection time zones. Each client that connects has its own time zone setting, given by the session
time_zone
variable. Initially, the session variable takes its value from the globaltime_zone
variable, but the client can change its own time zone with this statement:mysql>
SET time_zone =
timezone
;
The current session time zone setting affects display and storage
of time values that are zone-sensitive. This includes the values
displayed by functions such as
NOW()
or
CURTIME()
, and values stored in and
retrieved from TIMESTAMP
columns.
Values for TIMESTAMP
columns are
converted from the current time zone to UTC for storage, and from
UTC to the current time zone for retrieval.
The current time zone setting does not affect values displayed by
functions such as UTC_TIMESTAMP()
or values in DATE
,
TIME
, or
DATETIME
columns. Nor are values in
those data types stored in UTC; the time zone applies for them
only when converting from TIMESTAMP
values. If
you want locale-specific arithmetic for
DATE
,
TIME
, or
DATETIME
values, convert them to
UTC, perform the arithmetic, and then convert back.
The current values of the global and client-specific time zones can be retrieved like this:
mysql> SELECT @@global.time_zone, @@session.time_zone;
timezone
values can be given in several
formats, none of which are case sensitive:
The value
'SYSTEM'
indicates that the time zone should be the same as the system time zone.The value can be given as a string indicating an offset from UTC, such as
'+10:00'
or'-6:00'
.The value can be given as a named time zone, such as
'Europe/Helsinki'
,'US/Eastern'
, or'MET'
. Named time zones can be used only if the time zone information tables in themysql
database have been created and populated.
The MySQL installation procedure creates the time zone tables in
the mysql
database, but does not load them. You
must do so manually using the following instructions. (If you are
upgrading to MySQL 4.1.3 or later from an earlier version, you can
create the tables by upgrading your mysql
database. Use the instructions in Section 4.4.7, “mysql_upgrade — Check Tables for MySQL Upgrade”.
After creating the tables, you can load them.)
Loading the time zone information is not necessarily a one-time operation because the information changes occasionally. For example, the rules for Daylight Saving Time in the United States, Mexico, and parts of Canada changed in 2007. When such changes occur, applications that use the old rules become out of date and you may find it necessary to reload the time zone tables to keep the information used by your MySQL server current. See the notes at the end of this section.
If your system has its own zoneinfo
database (the set of files describing time zones), you should use
the mysql_tzinfo_to_sql program for filling the
time zone tables. Examples of such systems are Linux, FreeBSD,
Solaris, and Mac OS X. One likely location for these files is the
/usr/share/zoneinfo
directory. If your system
does not have a zoneinfo database, you can use the downloadable
package described later in this section.
The mysql_tzinfo_to_sql program is used to load the time zone tables. On the command line, pass the zoneinfo directory path name to mysql_tzinfo_to_sql and send the output into the mysql program. For example:
shell> mysql_tzinfo_to_sql /usr/share/zoneinfo | mysql -u root mysql
mysql_tzinfo_to_sql reads your system's time zone files and generates SQL statements from them. mysql processes those statements to load the time zone tables.
mysql_tzinfo_to_sql also can be used to load a single time zone file or to generate leap second information:
To load a single time zone file
tz_file
that corresponds to a time zone nametz_name
, invoke mysql_tzinfo_to_sql like this:shell>
mysql_tzinfo_to_sql
tz_file
tz_name
| mysql -u root mysqlWith this approach, you must execute a separate command to load the time zone file for each named zone that the server needs to know about.
If your time zone needs to account for leap seconds, initialize the leap second information like this, where
tz_file
is the name of your time zone file:shell>
mysql_tzinfo_to_sql --leap
tz_file
| mysql -u root mysqlAfter running mysql_tzinfo_to_sql, it is best to restart the server so that it does not continue to use any previously cached time zone data.
If your system is one that has no zoneinfo database (for example, Windows or HP-UX), you can use the package of pre-built time zone tables that is available for download at the MySQL Developer Zone:
http://dev.mysql.com/downloads/timezones.html
This time zone package contains .frm
,
.MYD
, and .MYI
files for
the MyISAM
time zone tables. These tables
should be part of the mysql
database, so you
should place the files in the mysql
subdirectory of your MySQL server's data directory. The server
should be stopped while you do this and restarted afterward.
Do not use the downloadable package if your system has a zoneinfo database. Use the mysql_tzinfo_to_sql utility instead. Otherwise, you may cause a difference in datetime handling between MySQL and other applications on your system.
For information about time zone settings in replication setup, please see Section 15.4.1, “Replication Features and Issues”.
As mentioned earlier, when the time zone rules change, applications that use the old rules become out of date. To stay current, it is necessary to make sure that your system uses current time zone information is used. For MySQL, there are two factors to consider in staying current:
The operating system time affects the value that the MySQL server uses for times if its time zone is set to
SYSTEM
. Make sure that your operating system is using the latest time zone information. For most operating systems, the latest update or service pack prepares your system for the time changes. Check the Web site for your operating system vendor for an update that addresses the time changes.If you replace the system's
/etc/localtime
timezone file with a version that uses rules differing from those in effect at mysqld startup, you should restart mysqld so that it uses the updated rules. Otherwise, mysqld might not notice when the system changes its time.If you use named time zones with MySQL, make sure that the time zone tables in the
mysql
database are up to date. If your system has its own zoneinfo database, you should reload the MySQL time zone tables whenever the zoneinfo database is updated, using the instructions given earlier in this section. For systems that do not have their own zoneinfo database, check the MySQL Developer Zone for updates. When a new update is available, download it and use it to replace your current time zone tables. mysqld caches time zone information that it looks up, so after replacing the time zone tables, you should restart mysqld to make sure that it does not continue to serve outdated time zone data.
If you are uncertain whether named time zones are available, for use either as the server's time zone setting or by clients that set their own time zone, check whether your time zone tables are empty. The following query determines whether the table that contains time zone names has any rows:
mysql> SELECT COUNT(*) FROM mysql.time_zone_name;
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
A count of zero indicates that the table is empty. In this case, no one can be using named time zones, and you don't need to update the tables. A count greater than zero indicates that the table is not empty and that its contents are available to be used for named time zone support. In this case, you should be sure to reload your time zone tables so that anyone who uses named time zones will get correct query results.
To check whether your MySQL installation is updated properly for a change in Daylight Saving Time rules, use a test like the one following. The example uses values that are appropriate for the 2007 DST 1-hour change that occurs in the United States on March 11 at 2 a.m.
The test uses these two queries:
SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central'); SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
The two time values indicate the times at which the DST change occurs, and the use of named time zones requires that the time zone tables be used. The desired result is that both queries return the same result (the input time, converted to the equivalent value in the 'US/Central' time zone).
Before updating the time zone tables, you would see an incorrect result like this:
mysql>SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+ | CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central') | +------------------------------------------------------------+ | 2007-03-11 01:00:00 | +------------------------------------------------------------+ mysql>SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+ | CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central') | +------------------------------------------------------------+ | 2007-03-11 02:00:00 | +------------------------------------------------------------+
After updating the tables, you should see the correct result:
mysql>SELECT CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+ | CONVERT_TZ('2007-03-11 2:00:00','US/Eastern','US/Central') | +------------------------------------------------------------+ | 2007-03-11 01:00:00 | +------------------------------------------------------------+ mysql>SELECT CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central');
+------------------------------------------------------------+ | CONVERT_TZ('2007-03-11 3:00:00','US/Eastern','US/Central') | +------------------------------------------------------------+ | 2007-03-11 01:00:00 | +------------------------------------------------------------+
In MySQL 5.5, leap second values are returned with
a time part that ends with :59:59
. This means
that a function such as NOW()
can
return the same value for two or three consecutive seconds
during the leap second. It remains true that literal temporal
values having a time part that ends with
:59:60
or :59:61
are
considered invalid.
If it is necessary to search for
TIMESTAMP
values one second
before the leap second, anomalous results may be obtained if you
use a comparison with 'YYYY-MM-DD hh:mm:ss'
values:
mysql>CREATE TABLE t1 (a INT, ts TIMESTAMP DEFAULT NOW(), PRIMARY KEY (ts));
Query OK, 0 rows affected (0.11 sec) mysql># Simulate NOW() = '2009-01-01 02:59:59'
mysql>SET timestamp = 1230768022;
Query OK, 0 rows affected (0.00 sec) mysql>INSERT INTO t1 (a) VALUES (1);
Query OK, 1 row affected (0.07 sec) mysql># Simulate NOW() = '2009-01-01 02:59:60'
mysql>SET timestamp = 1230768023;
Query OK, 0 rows affected (0.00 sec) mysql>INSERT INTO t1 (a) VALUES (2);
Query OK, 1 row affected (0.02 sec) mysql>SELECT * FROM t1;
+------+---------------------+ | a | ts | +------+---------------------+ | 1 | 2008-12-31 18:00:22 | | 2 | 2008-12-31 18:00:23 | +------+---------------------+ 2 rows in set (0.02 sec) mysql>SELECT * FROM t1 WHERE ts = '2009-01-01 02:59:59';
Empty set (0.03 sec)
To work around this, you can use a comparison based on the UTC value actually stored in column, which has the leap second correction applied:
mysql> SELECT * FROM t1 WHERE UNIX_TIMESTAMP(ts) = 1230768023;
+------+---------------------+
| a | ts |
+------+---------------------+
| 2 | 2008-12-31 18:00:23 |
+------+---------------------+
1 row in set (0.02 sec)
The locale indicated by the
lc_time_names
system variable
controls the language used to display day and month names and
abbreviations. This variable affects the output from the
DATE_FORMAT()
,
DAYNAME()
, and
MONTHNAME()
functions.
The lc_time_names
value does not
affect the result from FORMAT()
,
but this function takes an optional third parameter that enables a
locale to be specified to be used for the result number's decimal
point, thousands separator, and grouping between separators.
Permissible locale values are the same as the legal values for the
lc_time_names
system variable.
Locale names have language and region subtags listed by IANA
(http://www.iana.org/assignments/language-subtag-registry)
such as 'ja_JP'
or 'pt_BR'
.
The default value is 'en_US'
regardless of your
system's locale setting, but you can set the value at server
startup or set the GLOBAL
value if you have the
SUPER
privilege. Any client can
examine the value of
lc_time_names
or set its
SESSION
value to affect the locale for its own
connection.
mysql>SET NAMES 'utf8';
Query OK, 0 rows affected (0.09 sec) mysql>SELECT @@lc_time_names;
+-----------------+ | @@lc_time_names | +-----------------+ | en_US | +-----------------+ 1 row in set (0.00 sec) mysql>SELECT DAYNAME('2010-01-01'), MONTHNAME('2010-01-01');
+-----------------------+-------------------------+ | DAYNAME('2010-01-01') | MONTHNAME('2010-01-01') | +-----------------------+-------------------------+ | Friday | January | +-----------------------+-------------------------+ 1 row in set (0.00 sec) mysql>SELECT DATE_FORMAT('2010-01-01','%W %a %M %b');
+-----------------------------------------+ | DATE_FORMAT('2010-01-01','%W %a %M %b') | +-----------------------------------------+ | Friday Fri January Jan | +-----------------------------------------+ 1 row in set (0.00 sec) mysql>SET lc_time_names = 'es_MX';
Query OK, 0 rows affected (0.00 sec) mysql>SELECT @@lc_time_names;
+-----------------+ | @@lc_time_names | +-----------------+ | es_MX | +-----------------+ 1 row in set (0.00 sec) mysql>SELECT DAYNAME('2010-01-01'), MONTHNAME('2010-01-01');
+-----------------------+-------------------------+ | DAYNAME('2010-01-01') | MONTHNAME('2010-01-01') | +-----------------------+-------------------------+ | viernes | enero | +-----------------------+-------------------------+ 1 row in set (0.00 sec) mysql>SELECT DATE_FORMAT('2010-01-01','%W %a %M %b');
+-----------------------------------------+ | DATE_FORMAT('2010-01-01','%W %a %M %b') | +-----------------------------------------+ | viernes vie enero ene | +-----------------------------------------+ 1 row in set (0.00 sec)
The day or month name for each of the affected functions is
converted from utf8
to the character set
indicated by the
character_set_connection
system
variable.
lc_time_names
may be set to any
of the following locale values. The set of locales supported by
MySQL may differ from those supported by your operating system.
ar_AE : Arabic - United Arab Emirates | ar_BH : Arabic - Bahrain |
ar_DZ : Arabic - Algeria | ar_EG : Arabic - Egypt |
ar_IN : Arabic - India | ar_IQ : Arabic - Iraq |
ar_JO : Arabic - Jordan | ar_KW : Arabic - Kuwait |
ar_LB : Arabic - Lebanon | ar_LY : Arabic - Libya |
ar_MA : Arabic - Morocco | ar_OM : Arabic - Oman |
ar_QA : Arabic - Qatar | ar_SA : Arabic - Saudi Arabia |
ar_SD : Arabic - Sudan | ar_SY : Arabic - Syria |
ar_TN : Arabic - Tunisia | ar_YE : Arabic - Yemen |
be_BY : Belarusian - Belarus | bg_BG : Bulgarian - Bulgaria |
ca_ES : Catalan - Spain | cs_CZ : Czech - Czech Republic |
da_DK : Danish - Denmark | de_AT : German - Austria |
de_BE : German - Belgium | de_CH : German - Switzerland |
de_DE : German - Germany | de_LU : German - Luxembourg |
el_GR : Greek - Greece | en_AU : English - Australia |
en_CA : English - Canada | en_GB : English - United Kingdom |
en_IN : English - India | en_NZ : English - New Zealand |
en_PH : English - Philippines | en_US : English - United States |
en_ZA : English - South Africa | en_ZW : English - Zimbabwe |
es_AR : Spanish - Argentina | es_BO : Spanish - Bolivia |
es_CL : Spanish - Chile | es_CO : Spanish - Columbia |
es_CR : Spanish - Costa Rica | es_DO : Spanish - Dominican Republic |
es_EC : Spanish - Ecuador | es_ES : Spanish - Spain |
es_GT : Spanish - Guatemala | es_HN : Spanish - Honduras |
es_MX : Spanish - Mexico | es_NI : Spanish - Nicaragua |
es_PA : Spanish - Panama | es_PE : Spanish - Peru |
es_PR : Spanish - Puerto Rico | es_PY : Spanish - Paraguay |
es_SV : Spanish - El Salvador | es_US : Spanish - United States |
es_UY : Spanish - Uruguay | es_VE : Spanish - Venezuela |
et_EE : Estonian - Estonia | eu_ES : Basque - Basque |
fi_FI : Finnish - Finland | fo_FO : Faroese - Faroe Islands |
fr_BE : French - Belgium | fr_CA : French - Canada |
fr_CH : French - Switzerland | fr_FR : French - France |
fr_LU : French - Luxembourg | gl_ES : Galician - Spain |
gu_IN : Gujarati - India | he_IL : Hebrew - Israel |
hi_IN : Hindi - India | hr_HR : Croatian - Croatia |
hu_HU : Hungarian - Hungary | id_ID : Indonesian - Indonesia |
is_IS : Icelandic - Iceland | it_CH : Italian - Switzerland |
it_IT : Italian - Italy | ja_JP : Japanese - Japan |
ko_KR : Korean - Republic of Korea | lt_LT : Lithuanian - Lithuania |
lv_LV : Latvian - Latvia | mk_MK : Macedonian - FYROM |
mn_MN : Mongolia - Mongolian | ms_MY : Malay - Malaysia |
nb_NO : Norwegian(Bokmål) - Norway | nl_BE : Dutch - Belgium |
nl_NL : Dutch - The Netherlands | no_NO : Norwegian - Norway |
pl_PL : Polish - Poland | pt_BR : Portugese - Brazil |
pt_PT : Portugese - Portugal | ro_RO : Romanian - Romania |
ru_RU : Russian - Russia | ru_UA : Russian - Ukraine |
sk_SK : Slovak - Slovakia | sl_SI : Slovenian - Slovenia |
sq_AL : Albanian - Albania | sr_RS : Serbian - Yugoslavia |
sv_FI : Swedish - Finland | sv_SE : Swedish - Sweden |
ta_IN : Tamil - India | te_IN : Telugu - India |
th_TH : Thai - Thailand | tr_TR : Turkish - Turkey |
uk_UA : Ukrainian - Ukraine | ur_PK : Urdu - Pakistan |
vi_VN : Vietnamese - Viet Nam | zh_CN : Chinese - China |
zh_HK : Chinese - Hong Kong | zh_TW : Chinese - Taiwan Province of China |
lc_time_names
currently does not
affect the STR_TO_DATE()
or
GET_FORMAT()
function.