Awasu » Calling the Awasu API using Python 2
Thursday 13th July 2017 7:39 PM []

Subscribe to this feed in your Awasu, and you will see it has 2 items, one with an English title, and one in Japanese.

Getting the channel's HTML

This Python 2 code requests the channel's HTML page from Awasu[1]I'm assuming that no API token has been configured., and prints it out:

import urllib

# get the Awasu channel HTML
url = "http://localhost:2604/channels/get?name=encoding%20demo"
buf = urllib.urlopen( url ).read()
print buf

Run it, and you will see a bunch of HTML printed out, so it seems to work. However, there's a subtle pitfall hidden in these seemingly innocuous few lines of code. When we read the response from a web server, we're reading a string (the page's HTML) and so, according to The Rule[2]Every time you do something with a string, you must know what character set is being used, and how the string was encoded., we need to know how it was encoded, but figuring this out can be tricky, since it can be specified in an HTTP response header, or embedded in the page itself[3]Of course, if it's embedded in the HTML (in a <meta> tag), you need to know the encoding beforehand, so that you can parse the HTML, in order to extract the encoding :wall: .

Fortunately, in this particular case, things are easy - we know we're talking to Awasu, and Awasu uses UTF-8 for everything - so we know that buf will contain a UTF-8-encoded string.

Extracting the item titles

This bit of code extracts the item titles[4]Regular expressions are not the best way of parsing HTML, but it better highlights the issues under discussion.

# extract the item titles
for mo in re.finditer( r"""<span class="itemTitle">\s*<a href=.*?>(.*?)</a>""" , buf ) :
    print mo.group(1)

When we run it, the output for "日本語" is garbled, but this is what it looks like when it has been encoded using UTF-8. If you pipe the output to a file, and then open that file in Notepad, you will see the correct text[5]Notepad also has to follow The Rule, but since it has no way of knowing what encoding was used when the file was saved, it has to guess, and in this particular case, correctly.

It's now easy to generate a bit of HTML that lists the item titles:

# extract the item titles
print "<ul>"
for mo in re.finditer( r"""<span class="itemTitle">\s*<a href=.*?>(.*?)</a>""" , buf ) :
    print "<li>" , mo.group(1)
print "</ul>"

Save the output to a file, and open it in Internet Explorer.

Looks good.


But if we open it in Firefox, the Japanese text is garbled.

What's happened is that browsers, like Notepad, also also need to follow The Rule, and since the HTML page hasn't declared what encoding it's using, IE has taken a guess and got it right, while Firefox insists on using an encoding it calls "Western" (also known as ISO-8859-1).

If you manually force the page to be interpreted using the "Unicode" encoding (i.e. UTF-8), the Japanese text appears correctly.


So, all we need to do is add a <meta> tag to the HTML that declares the page is encoded using UTF-8, and the page will render correctly in any browser:

# extract the item titles and generate the HTML
print "<head>"
print "<meta charset='UTF-8'>"
print "</head>"
print "<ul>"
for mo in re.finditer( r"""<span class="itemTitle">\s*<a href=.*?>(.*?)</a>""" , buf ) :
    print "<li>" , mo.group(1)
print "</ul>"
« Of character sets, encodings, and other wondrous things

Tutorial index

Calling the Awasu API using Python 3 »


   [ + ]

1. I'm assuming that no API token has been configured.
2. Every time you do something with a string, you must know what character set is being used, and how the string was encoded.
3. Of course, if it's embedded in the HTML (in a <meta> tag), you need to know the encoding beforehand, so that you can parse the HTML, in order to extract the encoding :wall:
4. Regular expressions are not the best way of parsing HTML, but it better highlights the issues under discussion.
5. Notepad also has to follow The Rule, but since it has no way of knowing what encoding was used when the file was saved, it has to guess, and in this particular case, correctly
Have your say