Into to Python Unicode

ASCII table has 128 code points, which is not sufficient for all the languages and accents, that is why Unicode was invented. It can have up to 1.114.112 code points. The first 128 code points are ASCII making in backward compatibility.

The important thing is that Unicode is not encoding, it only specifies the map for the code points. There are multiple encodings, UTF-8, UTF-16, UTF-32, and others, with UTF-8 being the most popular and in Python, it is the default encoding.

There are two representations of data in Python: bytes and strings. The process of encoding and decoding is simply the process of moving between these two representations.

If I write

"hello world".encode("utf-8")

It prints

b'hello world'

It is the same because the string is all in ASCII. But if I write

"café".encode("utf-8")

I get this

b'caf\xc3\xa9'

The first three letters are the same since they in ASCII table. The last one, however, is not and utf-8 represents this letter with two bytes \xc3 and \xa9.

The utf-8 encoding can have variable lengths. ASCII is always represented by a single byte, but utf-8 can have up to four bytes.

The opposite of encoding is decoding, so if i have a bytes representation, and write

b'caf\xc3\xa9'.decode()

I get back the word "café".

Python3 has Unicode as default, so it means that all strings are Unicode and can contain any Unicode character. They can even be used as variable names, but it is not recommended.

Regular expressions have also Unicode as default.