Awasu » Calling the Awasu API using Python 3
Thursday 13th July 2017 7:40 PM []

Let's try to run the same script as before, but using Python 3:

import urllib.request

# get the Awasu channel HTML
url = "http://localhost:2604/channels/get?name=encoding%20demo"
buf = urllib.request.urlopen( url ).read()
print( buf )

It sort of works, but now we're seeing lots of \r's and \n's, instead of the properly formatted HTML we got when using Python 2.

If we check the type of the buf variable, it's a bytes variable, whereas it was a str under Python 2. Python 3 is actually doing the Right Thing™ here - when you download something from a web server, it won't necessarily be text (e.g. it could be an image, or a ZIP file), so Python 3 returns a series of raw bytes, and it's up to you to decide how to interpret it. Python 2 also returns a series of bytes, but to give it to you in string variable is playing a bit fast and loose.

So, to convert these raw bytes into a string[1]Note that in practice, it may be better to leave it as raw bytes, since converting it to a Unicode string can cause problems., we need to decode it:

buf = urllib.request.urlopen( url ).read()
buf = buf.decode( "utf-8" )
print( buf )

We know to decode it as UTF-8, since Awasu always uses UTF-8 for everything.

Extracting the item titles

We can extract and print out the item titles as before:

# extract the item titles
print( "" )
print( "" )
print( "" )
print( "
    " ) for mo in re.finditer( r"""\s*(.*?)""" , buf ) : print( "
  • " + mo.group(1) ) print( "
" )

But it doesn't quite work - the Japanese text is being displayed as question marks.

This is another example of when it's not obvious when The Rule[2]Every time you do something with a string, you must know what character set is being used, and how the string was encoded. applies. When we print something out, we are doing something with a string (sending it as a series of bytes to the console), and so need to think about what encoding it will be in.


We want to output text encoded using UTF-8, but if we do something like this:

print( b"<li> " + mo.group(1).encode("utf-8") )

Python 3 outputs the representation of the bytes value, not the value itself :wall: Python 3's print() function only accepts Unicode strings, and if we give it something else, it will output its representation, not its actual value.

One way around this is to bypass Python 3's print() function and output raw bytes to the console directly e.g.

def print_utf8( val ) :
    sys.stdout.buffer.write( val.encode( "utf-8" ) )
    sys.stdout.buffer.write( b"\n" )

# extract the item titles
print_utf8( "<head>" )
print_utf8( "<meta charset=\"UTF-8\">" )
print_utf8( "</head>" )
print_utf8( "
    " ) for mo in re.finditer( r"""\s*(.*?)""" , buf ) : print_utf8( "
  • " + mo.group(1) ) print_utf8( "
" )

However, this still doesn't work! :wall:

What's happening is that the the console also needs to follow The Rule (it is doing something with a string: displaying it), so it needs to know how it has been encoded (so that it can get the code points for each letter, so that it can display them). As with Notepad, IE and Firefox earlier, the console may take a guess as to what encoding to use, or it may have a setting, or it may do something else[3]Depending, of course, on the phase of the moon 😥 .

However, if we pipe the output to a file, and then examine the file, we can see that it has been written using UTF-8.

« Calling the Awasu API using Python 2

Tutorial index

This is all too complicated, just tell me what to do! »


   [ + ]

1. Note that in practice, it may be better to leave it as raw bytes, since converting it to a Unicode string can cause problems.
2. Every time you do something with a string, you must know what character set is being used, and how the string was encoded.
3. Depending, of course, on the phase of the moon 😥
Have your say