Page 1 of 1

Random characters on the end of config files

Posted: Sat Feb 28, 2015 1:11 am
by martymarty
Hi! New user of Awasu here.

I've been experimenting with writing some chanel plugins and have noticed that sometimes the data is garbled. The data has strange characters at the end, like %0A*2. Is this corrupted data?

Re: Random characters on the end of config files

Posted: Sat Feb 28, 2015 12:33 pm
by support
martymarty wrote:Hi! New user of Awasu here.

G'day and welcome aboard!

martymarty wrote:I've been experimenting with writing some chanel plugins and have noticed that sometimes the data is garbled. The data has strange characters at the end, like %0A*2. Is this corrupted data?

Identifying whether content is HTML or plain-text has been a battleground throughout the entire history of RSS :-(

When Awasu reads content from a feed, that content is marked as either HTML, PLAIN-TEXT or UNKNOWN. RSS has no concept of this kind of thing, so content from an RSS feed will always be marked as UNKNOWN. RSS's replacement, Atom, does understand the difference, and so content from an Atom feed will be marked as HTML or PLAIN-TEXT.

When Awasu passes this content through to an extension, content that comes from a feed has its type appended to it. The %0A is actually a line-feed (since the INI files are percent-encoded), then the content type:
  • *0 = unknown
  • *1 = plain-text
  • *2 = html

Re: Random characters on the end of config files

Posted: Sat Feb 28, 2015 7:14 pm
by martymarty
Does that mean I can ignore them? What did you mean by battleground?

Re: Random characters on the end of config files

Posted: Sat Feb 28, 2015 7:59 pm
by support
martymarty wrote:Does that mean I can ignore them?

Most of the time, you can strip these off and ignore them, because most of the time, it makes no difference. The problem is, the few times when it does make a difference, it's impossible to figure out the right thing to do without this information.

Certain characters are important and have a special meaning in HTML e.g. &. It's used to write other special characters e.g. &lt; if you want a < to appear. And if you want to write an &, you have to write &amp;

The problem is, when somebody writes something like "Ben & Jerry's", did they actually want an &, or was it part of a special HTML sequence? In this particular example, it's easy to tell that the author wanted an &, but say someone is writing an article about HTML (like this one :)), and they write "&amp;" - did they want an & to appear, or did they want those 5 characters to appear verbatim (e.g. as in "if you want an ampersand to appear, you must write &amp;"). There's no way to tell.

Atom (the successor to RSS) handles this by requiring that all content be declared as HTML or plain-text. If something is HTML, an & is assumed to be a special HTML sequence (as in &amp;), if it's plain-text, ampersands are interpreted as being ampersands. RSS doesn't have this concept, it just has "text", so every time Awasu sees an ampersand, it has to guess what the author's intent is (and sometimes it guesses wrong - you will occasionally see ampersands go missing in the content in Awasu).

The TL;DR is you should really consider whether content is HTML or plain-text, but most of the time, you can get away with not worrying about it.

martymarty wrote:What did you mean by battleground?

Ah ha :-)

RSS was invented by a guy called Dave Winer, who was, shall we say, a little lenient when it came to specifying exactly how RSS should work, thus giving rise to problems like the one I described above. He is also a little combative, and tended to rub people the wrong way, which didn't help. So a bunch of guys got together to devise a successor to RSS, called Atom. It has been accused of being complex and over-engineered, but by and large, it works well and fixes the problems RSS had.

If you're interested, you can find out more here and here. This is all ancient history, BTW...

Re: Random characters on the end of config files

Posted: Sat Feb 28, 2015 8:01 pm
by support
BTW, I've been doing a lot of work recently on some new plugins, and have written some wrapper libraries that hide a lot of this complexity. If you're working with Python, I can send you a pre-release copy to have a play with...

Re: Random characters on the end of config files

Posted: Mon Mar 02, 2015 8:55 am
by kevotheclone
This page compares Atom 1.0 and RSS 2.0
The Payload section discusses the different content types, although it may be a little out of sync with the final Atom specification's description of content types.