UTF-8 all the way through
PostgreSQL implements UTF-8 as a server encoding and as a client encoding, so that you can use unicode all the way through.
The default encoding and collation for a PostgreSQL database server are setup at initdb time. This is usually done by your packaging system at installation.
Database Encoding
It’s important to set your environement correctly before installing
PostgreSQL, in order to avoid surprises. The following example sets the
system to use en_US.UTF-8
before installing PostgreSQL on a docker
container using debian:
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get install -y locales
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \
locale-gen en_US.UTF-8 && \
dpkg-reconfigure locales && \
/usr/sbin/update-locale LANG=en_US.UTF-8
ENV LC_ALL en_US.UTF-8
To check your current installation settings, you can have a look at the
template0
and template1
databases settings:
\l template[01]
We can see that our local PostgreSQL cluster is using the intended server side settings for unicode:
List of databases
Name │ Owner │ Encoding │ Collate │ Ctype │ Access privileges
═══════════╪══════════╪══════════╪═════════════╪═════════════╪═══════════════════════
template0 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵
│ │ │ │ │ postgres=CTc/postgres
template1 │ postgres │ UTF8 │ en_US.UTF-8 │ en_US.UTF-8 │ =c/postgres ↵
│ │ │ │ │ postgres=CTc/postgres
(2 rows)
When that’s not the case, you need to destroy your setup and start from scratch again. How exactly to do that depend on your distribution. Or maybe on your Cloud provider.
Server Encoding and Client Encoding
You can also have an UTF8 encoded database and use a legacy application (or programming language) that doesn’t know how to handle Unicode properly. In that case, you can ask PostgreSQL to convert all and any data on the fly between the server-side encoding and your favorite client-side encoding, thanks to the client_encoding setting.
show client_encoding;
Here again, we use UTF8 client side, which allows handling French accentuated characters we saw previously.
client_encoding
═════════════════
UTF8
(1 row)
Be aware that not all combinations of server encoding and client encoding make sense. While it is possible for PostgreSQL to communicate with your application using the latin1 encoding on the client side if the server side dataset includes texts in incompatible encodings, PostgreSQL will issue an error. Such texts might be written using non-Latin scripts such as Cyrillic, Chinese, Japanese, Arabic or other languages.
From the Emacs facility M-x view-hello-file
, here’s a table with spelling
of hello in plenty of different languages and scripts, all encoded in
UTF8:
language │ hello
════════════════════════════╪═════════════════════════════
Czech (čeština) │ Dobrý den
Danish (dansk) │ Hej / Goddag / Halløj
Dutch (Nederlands) │ Hallo / Dag
English /ˈɪŋɡlɪʃ/ │ Hello
Esperanto │ Saluton (Eĥoŝanĝo ĉiuĵaŭde)
Estonian (eesti keel) │ Tere päevast / Tere õhtust
Finnish (suomi) │ Hei / Hyvää päivää
French (français) │ Bonjour / Salut
Georgian (ქართველი) │ გამარჯობა
German (Deutsch) │ Guten Tag / Grüß Gott
Greek (ελληνικά) │ Γειά σας
Greek, ancient (ἑλληνική) │ Οὖλέ τε καὶ μέγα χαῖρε
Hungarian (magyar) │ Szép jó napot!
Italian (italiano) │ Ciao / Buon giorno
Maltese (il-Malti) │ Bonġu / Saħħa
Mathematics │ ∀ p ∈ world • hello p □
Mongolian (монгол хэл) │ Сайн байна уу?
Norwegian (norsk) │ Hei / God dag
Polish (język polski) │ Dzień dobry! / Cześć!
Russian (русский) │ Здра́вствуйте!
Slovak (slovenčina) │ Dobrý deň
Slovenian (slovenščina) │ Pozdravljeni!
Spanish (español) │ ¡Hola!
Swedish (svenska) │ Hej / Goddag / Hallå
Turkish (Türkçe) │ Merhaba
Ukrainian (українська) │ Вітаю
Vietnamese (tiếng Việt) │ Chào bạn
Japanese (日本語) │ こんにちは / コンニチハ
Chinese (中文,普通话,汉语) │ 你好
Cantonese (粵語,廣東話) │ 早晨, 你好
Now, of course, I can’t have that data sent to me in latin1:
yesql# set client_encoding to latin1;
SET
yesql# select * from hello where language ~ 'Georgian';
ERROR: character with byte sequence 0xe1 0x83 0xa5 in encoding "UTF8" has no equivalent in encoding "LATIN1"
yesql# reset client_encoding ;
RESET
So if it’s possible for you, use UTF-8 encoding and you’ll have a much simpler life. It must be noted that Unicode encoding makes comparing and sorting text a rather costly operation. That said being fast and wrong is not an option, so we are going to still use unicode text!