View unanswered posts | View active topics

Reply to topic  [ 5 posts ] 
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Unencoding character entities
In my Channel Report output I'm getting characters encoded like this ’ and “ as well as several others.

These appear no matter what "Feed Content" setting I use, even "Plain text".
I've also tried encode=sgml and encode=percent settings, with no luck; is there some other setting that I'm missing (or undocumented) that would convert these character entity's back to their displayed characters?

’ Apostrophe

“ Left quote

” Right quote

> >

– –

& Ampersand


Wed Jun 08, 2011 8:15 pm
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: Unencoding character entities
kevotheclone wrote:
I've also tried encode=sgml and encode=percent settings, with no luck

Can you post the entire template tag you're using. Where the encode= parameter is supported, Awasu also recognizes chars= to specify which characters need to encoded. By default, these are:
(*) JSON: \ and "
(*) SGML: < and &
(*) PERCENT: %

The most likely explanation for what you're seeing is bodgy content :roll: There are more than a few publishers out there pushing out incorrectly encoded content and Awasu has to guess at what to do. Try running your report against a test feed file, where you know exactly what's in the content, and see what happens.


Thu Jun 09, 2011 1:57 am
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: Unencoding character entities
I should have mentioned that I used the chars= setting in conjunction with the encode= setting.

Here's one of the feeds that exhibit the encoding: http://www.awasu.com/weblog/?feed=atom Take a look at the blog post on Dutch Tilders, for a couple of examples of &#8217; or &amp;#8217; depending upon the "Feed content" setting. You can use Awasu's default Channel Report template to see the behavior; you may need to View Source in your browser to see the real encoding. Also, you can see the &#8217; characters in the original blog post: http://www.awasu.com/weblog/?p=788 so it's not just a bad feed.

My Channel Report is an XML format, you know the one I've been working on for a while now, so there's no web browser to convert &#8217; to a fancy aposthrophe.

I do have some Python code to convert it for me, so I can handle it in a post-processing command, but I was just curious if Awasu had an undocumented decode= setting.

Code:
import re

def _callback(matches):
    id = matches.group(1)
    try:
        return unichr(int(id))
    except:
        return id

def decode_unicode_references(data):
    data = re.sub("&amp;", "&", data)
    return re.sub("&#(\d+)(;|(?=\s))", _callback, data)

def main():
    data = "U.S. &amp;#8211; Adviser&#8217;s Blunt &#8211; Memo on Iraq: Time &#8216;to Go Home&#8217;"
    print decode_unicode_references(data)

if __name__ == '__main__':
    main()


Thu Jun 09, 2011 2:48 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: Unencoding character entities
kevotheclone wrote:
These appear no matter what "Feed Content" setting I use, even "Plain text".

It works correctly for me if you use Full content - maybe you had some leftover encode=... settings that was breaking things...?

This is what's happening (are you sitting comfortably?)...

Awasu tags every piece of text it records as being PLAIN TEXT, HTML, or UNKNOWN. The feed you gave as an example uses Atom so each piece of text can definitively be identified as TEXT or HTML (yay!). UNKNOWN is used for RSS content.

In the article about the Dutchman, the feed XML looks something like this:
Code:
<content type="html"> <![CDATA[ ... They&#8217;re ... ]]> </content>

so the content gets recorded as HTML, with the 7 characters that make up the encoded character i.e. it doesn't get decoded down to a single character. This is correct behavior.

When inserting the content into a report, Awasu must consider the type of data (TEXT/HTML/UNKNOWN) being inserted, and the output format (HTML/XML/etc.) and encode accordingly. If we are using Awasu's default report, set to use Full content, Awasu will see the data type as being HTML, and since the output format is HTML, no encoding is necessary and the content gets inserted verbatim. The browser will then convert the 7-character sequence into the correct character when it is rendering the page.

However, if you are using Excerpt, Awasu has to create an excerpt of the content and a side-effect of this process is that the content is always set to type TEXT (there are good reasons for doing this). So, when it comes time to insert the content into the report, data of type TEXT must be encoded so that it will render correctly in an HTML page, which is why you're seeing the encoded string.

So... it seems that the underlying problem is that content gets set to type TEXT when it is excerpted - Awasu strips out HTML tags as part of this process but doesn't decode SGML entities. "Fixing" this is hairy and I'm not sure it wouldn't cause more problems than it solves... :roll:


Thu Jun 09, 2011 5:48 am
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: Unencoding character entities
support wrote:
"Fixing" this is hairy and I'm not sure it wouldn't cause more problems than it solves...

That's ok, I wasn't looking at this as a real bug that needed to be fixed, I was just looking to see if there was an undocumented feature that I could utilize.
Remember several months ago our discussions lead to the (re)discovery of a previously undocumented {%UNDEFINE% paramName} statement; I was just hoping to find another hidden gem.

I'll write a simple Python exe that will decode these characters back to their single-character representations via STDIN and STDOUT, and I'll call it from my other post-processing command. If my decoder works well enough, maybe we'll add it to Awasu's wiki so anyone else, with a need, can use it too. :drevil:


Thu Jun 09, 2011 4:04 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 5 posts ] 

Who is online
Users browsing this forum: Bing [Bot] and 1 guest

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to: