Awasu » This is all too complicated, just tell me what to do!
Thursday 13th July 2017 7:40 PM []

It was a lot easier to get the Python 2 code working than the Python 3 version, but this is deceptive, since there was a bit of hand-waving and glossing over encoding and is-this-a-string-or-bytes issues, but things worked because everything happened to be in UTF-8. Python 3 is much stricter about these things, which means that it will complain if you don't get things right, but it also means that your code will work properly because it's, well, right. It's easy to write Python 2 code that looks like it's working, only to find out that you were just lucky, and it doesn't handle certain cases properly 😥

Python 2

In Python 2, it's probably easier to do everything using UTF-8, but this means you need to test your code carefully, since it's easy to write code that works with English text, but fails with non-English text.

Some people advocate using Unicode strings (i.e. variables of type unicode, not str), which is also OK, but since Awasu always gives you UTF-8, you will be converting that to Unicode strings, then if you want to output UTF-8, you'll have to convert it back again i.e. it's inefficient.

However, note that if you ask Awasu to return JSON, Python's json library will convert strings to Unicode for you e.g.

import urllib
import json

# get the Awasu channels
url = "http://localhost:2604/channels/list?format=json"
buf = urllib.urlopen( url ).read()
print "API response type:" , type(buf)

# output the channel names
for channel in json.loads(buf)["channels"] :
    print channel["name"] , "=>" , type(channel["name"])



So, if you want to output UTF-8, you will need to remember to encode it e.g.

for channel in json.loads(buf)["channels"] :
    print channel["name"].encode("UTF-8")

Python 3

Python 3 really, really, really wants you to use Unicode strings (variables of type str). As explained above, Python's json library will convert strings to Unicode for you, so if you're using JSON to talk to Awasu, everything will have already been converted for you.

However, if you're receiving HTML or XML, things are a little more complicated. You will get back a series of raw bytes (in a bytes variable) and while yes, these are text documents, if you convert them to Unicode strings, subtle problems can arise since they are not really one big string, but binary documents that contain textual content[1]For example, let's say you receive an HTML page that has a <meta> tag that declares it to be UTF-8. So, you decode it as UTF-8 and store it in a Unicode string variable. Later, you write it out to a file, encoded as UTF-16, but someone trying to read that file won't be able to, since the file is declaring that it has been encoded as UTF-8, but it was actually encoded using UTF-16.. For these, you're probably better off storing them as byte data, and decoding as necessary when you extract values out of them.

Working with Awasu

When it comes to working with Awasu, things are fairly straight-forward:

  • Awasu uses Unicode and UTF-8 for everything, including all API responses, generated HTML pages and reports, etc., whether read over the wire, or from a file.
  • If you need to output stuff[2]For example, generate an HTML page or send an email., how you do it depends on what will be receiving that output, but you generally can't go wrong with UTF-8, in which case:
    • For HTML, include a <meta charset="UTF-8"> tag in the <head> section.
    • For XML, output <?xml version="1.0" encoding="UTF-8"?> as the first line.
    • JSON uses UTF-8 by default, but if you want to include Unicode characters, they must be escaped e.g. "\u65E5\u672C" for "日本".
    • For plain-text, some programs check if the first character of a document is a byte order mark, which is the special Unicode character 0xFEFF. This becomes 0xEF, 0xBB, 0xBF when encoded as UTF-8, so if you insert these 3 bytes at the start of your output, the reader will deduce that the document must have been encoded using UTF-8.

Fixed That For You

In closing, a quick word about ftfy, a Python module that is often used to handle character set and encoding issues.

To quote from its documentation (emphasis added):

The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code. This is different from taking in non-Unicode and outputting Unicode, which is not a goal of ftfy. It also isn’t designed to protect you from having to write Unicode-aware code. ftfy helps those who help themselves.

The use of ftfy is warranted when you're receiving data from somewhere that has messed up character sets and/or encoding, and you don't have any way of fixing these problems, but you still need to accept their data. It shouldn't be used to get things working because you're messing up character sets and/or encoding 🙂

« Calling the Awasu API using Python 3

Tutorial index

 


   [ + ]

1. For example, let's say you receive an HTML page that has a <meta> tag that declares it to be UTF-8. So, you decode it as UTF-8 and store it in a Unicode string variable. Later, you write it out to a file, encoded as UTF-16, but someone trying to read that file won't be able to, since the file is declaring that it has been encoded as UTF-8, but it was actually encoded using UTF-16.
2. For example, generate an HTML page or send an email.
Have your say