Awasu » Of character sets, encodings, and other wondrous things
Thursday 13th July 2017 7:39 PM []

One of the most important things to understand is the difference between a character set and an encoding. Confusing the two is probably the source of most misunderstandings how all this works.

As an example, these are character sets:

  • ASCII
  • Unicode
  • Windows-1252 (also known as CP-1252)
  • ISO-8859-1 (also known as Latin-1)
  • GB2312

while these are encodings:

  • UTF-8
  • UTF-16
  • UTF-32
  • EUC-JP

What's the difference?!

Character sets

If you want to deal with a piece of text (i.e. a "string"), the first problem you face is that computers can only handle numbers, not letters. The obvious solution is to come up with a system that converts letters to numbers (known as "code points") e.g. A=1, B=2, C=3, etc.

One popular system (or "character set") is ASCII, and if we refer to the chart to the right, we can see that the string "Hello" would be represented by the numbers: 72 101 108 108 111.

Another character set is BCD[1]This was very popular in the early days of computing, although it's not used much these days. (see below), and the same string "Hello" would be represented by the numbers: 24 21 35 35 38.

Note that when you want to convert a series of numbers back into letters, you must know what character set is being used. For example, if we tried to decode the ASCII numbers above as BCD, or vice versa, we won't get the correct string back, since each number corresponds to different letters in the two character sets.

ASCII only allows for 128 different numbers (and hence a maximum of 128 different letters) and BCD even fewer (64), so while they might be OK for English, they're woefully inadequate for languages such as Chinese or Japanese, which have many thousands of characters between them. So, a new character set called Unicode was created, that allows for over a million characters[2]Although only about 12% of these are currently in use..

Encodings

If you want to write code that can handle international text, Unicode is the only character set you need to worry about. For example, if you wanted to write "日本"[3]"Japan", in Japanese, these 2 characters would be represented by the numbers 26085 and 26412[4]Since Unicode has over a million characters, it can't be represented in a screenshot as for ASCII and BCD, but you can look up characters at places like this or this., or in hexadecimal: 0x65E5 and 0x672C.

However, there's a snag: if we want to store these numbers in memory, it's not so simple, since memory is a series of bytes, which can only handle numbers from 0 to 255, and these numbers are both much bigger than that. One solution is to store each number spread over two bytes, like this:

However, some CPU's were designed to store 2-byte values the other way around[5]You might ask why people would want to store numbers the "wrong" way around, but in the early days of computing, memory diagrams were drawn vertically, going upwards, and storing the low byte first makes sense there., like this:

This is known as a little-endian system, since the low-value (or little) byte is stored first, while the first system above is known as big-endian.

The important thing to understand here is that there are 2 different ways we can store the numbers in memory (known as "encodings"), and as with character sets, it's crucial to know which one is being used when you're trying to convert the numbers back to text. If you take a string that was encoded big-endian (i.e. the first diagram above) and try to read it back as little-endian, the numbers will come out wrong (i.e. 0xE565 and 0x2C67), and hence will be converted to the wrong letters.

Common encodings

Back in the day, when the Unicode was smaller and only had 65,535 characters, the 2 encodings above were, in fact, used[6]These encodings are known as UCS-2BE and UCS-2LE., but are now obsolete[7]Since Unicode has grown, and would now require at least 3 bytes to store all the possible values.. Furthermore, encodings like these are very wasteful when storing plain ASCII text[8]Since they would use 3 bytes to store every letter, when each one really only needs 1 byte., so new encodings have been devised to address these problems. By far, the most common is UTF-8, but there are others e.g. UTF-16 or EUC-JP.

How each encoding works is not important for the purpose of this discussion, but it's crucial to remember that they all do the same thing: convert code points (i.e. the numbers that represent each letter of your string) into bytes that can be stored in memory[9]Or in a file, or in a network packet, anything that deals with a series of bytes.. However, the way each encoding does this is different, so if you store a string in memory using one encoding, and then try to read it back using another, the code points will be wrong, and so the string read back will be wrong. You can think of it like compressing files - if you compress a file to a ZIP, then try to decompress it as a TAR, it won't work.

Putting it all together

In practice, there is only one rule: every time you do something with a string, you must know what character set is being used, and how the string was encoded. And just to make things more difficult 🙂 , it's not always obvious when "every time you do something with a string" applies.

As an example, we'll write some Python code that gets information out of Awasu, some of which is not English, and generate an HTML page listing it, first using Python 2, and then again using Python 3, taking a look at the coding issues that come up, and how the two Pythons differ.

But Python 3 uses Unicode strings...?

As an aside, I suspect a lot of confusion people have about using strings in Python is because they keep hearing the phrase "Python 3 uses Unicode strings", which is a little misleading.

In Python 2, string variables are stored in memory as a series of bytes, nothing more, and can be used to store Unicode text (e.g. Chinese, Japanese, or other non-ASCII stuff), but you have to manage the encoding yourself. If you adopt a convention of always using, say, UTF-8, then you can quite happily manage Unicode text, even in Python 2. However, Python 2 string variables can also be used to store ASCII strings[10]Or text in any character set, since they are, after all, just stored as a series of bytes., so you need to be very aware of what character set each string variable is using, since it's very easy to write code that looks like it's working (because you only tested it with ASCII text), only to have it fail when you push Unicode text through it 🙁

In Python 3, string variables always use the Unicode character set, and if you try to store ASCII text in one, it will be converted to Unicode first[11]This will always work, since ASCII is a subset of Unicode.. These strings are then stored in memory using the UTF-16 or UTF-32 encoding, depending on what platform and build of Python you are using. If you really want to store the string as a series of bytes[12]For example, because you really want an ASCII string, or a Unicode string encoded using UTF-8., you need to use a bytes variable.

So, the difference is:

  • in Python 2, strings are stored as a series of bytes (in variables of type str), and if you want to handle Unicode text, you have to manage the encoding yourself. Or, just use variables of type unicode.
  • in Python 3, strings (variables of type str) always use the Unicode character set, with the encoding managed internally by the Python interpreter[13]Which encoding it uses is not really important, unless you ever need access to the underlying bytes., and if you want a series-of-bytes string, you have to use a bytes variable, and manage the encoding yourself.
 

Tutorial index

Calling the Awasu API using Python 2 »

   [ + ]

1. This was very popular in the early days of computing, although it's not used much these days.
2. Although only about 12% of these are currently in use.
3. "Japan", in Japanese
4. Since Unicode has over a million characters, it can't be represented in a screenshot as for ASCII and BCD, but you can look up characters at places like this or this.
5. You might ask why people would want to store numbers the "wrong" way around, but in the early days of computing, memory diagrams were drawn vertically, going upwards, and storing the low byte first makes sense there.
6. These encodings are known as UCS-2BE and UCS-2LE.
7. Since Unicode has grown, and would now require at least 3 bytes to store all the possible values.
8. Since they would use 3 bytes to store every letter, when each one really only needs 1 byte.
9. Or in a file, or in a network packet, anything that deals with a series of bytes.
10. Or text in any character set, since they are, after all, just stored as a series of bytes.
11. This will always work, since ASCII is a subset of Unicode.
12. For example, because you really want an ASCII string, or a Unicode string encoded using UTF-8.
13. Which encoding it uses is not really important, unless you ever need access to the underlying bytes.
Have your say