View unanswered posts | View active topics

Reply to topic  [ 27 posts ]  Go to page 1, 2  Next

Joined: Mon Mar 09, 2009 2:29 pm
Posts: 27
Post encode and bad charactors in RSS
I am getting the following warnings when validating an rss feed that was created by a channel report.

Code:
line 37, column 0: description contains bad characters (272 occurrences) [help]

\x95 Dell Dimension computers (for parts) $5.00 (each) line 839, column 112: title should not contain HTML: & (26 occurrences) [help]

... int &amp;amp; pads)** (laguna beach) $55</title>                                             ^line 7813, column 0: style attribute contains potentially dangerous content: background-repeat (2 occurrences) [help]

&lt;tr>&lt;td valign=&quot;top&quot;>&lt;table bgcolor=&quot;#acc9d9&quot; b ...line 8429, column 31: title contains bad characters (2 occurrences) [help]

<title>Moving Sale \x96 EVERYTHING GOES  (Laguna Niguel)</title>


I have tried the following with no success.

Code:
<description>{%ITEM-METADATA% description encode=sgml chars="<&\""}</description>

<description><CDATA> </description>

for the title
Code:
<title>{%ITEM-METADATA% name! encode=sgml chars="<&\""}</title>


I have also tried using both full content and plain text when setting up the channel report. Everything else validates except for what is posted above.

Any hints on how to solve this and what might be the reason?

Thanks

Jerry


Mon May 18, 2009 1:45 pm
Profile WWW
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: encode and bad charactors in RSS
The forum software has a tendency to mangle HTML/template code a bit so email me a copy of your report template and an example of a feed that doesn't validate.

You might want to take a look at the odd "\x95" things you seem to have in your feed, they look wrong :-)


Mon May 18, 2009 4:35 pm
Profile WWW
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: encode and bad charactors in RSS
Thanks for sending the stuff through. A couple of points from your email...

jerrymartin wrote:
The xml file does validate but with the errors I posted on the forum.

There's no such thing as XML that validates but contains a few errors :-) XML is very strict on this, either it's 100% valid or it's wrong. I suspect the errors you're referring are coming from the Feed Validator. This doesn't validate XML (if the XML is invalid, it can't even start to do anything) but instead checks the contents of your feed for errors or other things that might cause problems. For example, it checks that all your items have associated URL's, they have titles, etc.

jerrymartin wrote:
The strange characters seem to be <& ' and others. I have tried with and without cdata and both produce these same chars. I have to recheck but it seems that Awasu and IE do interpret these characters correctly but it looks ugly in the xml and is an error when validated.

Yes, these characters have special meaning in XML and therefore need to be handled carefully. XML is, unfortunately, inherently ugly :roll: and there's no getting away from having to escape these characters in your feed XML. However, XML is there for computers to read, not humans, so if it's ugly, it's not too important.

The "error" you're referring to is actually a warning, coming from the Feed Validator. Every error and warning it issues comes with a link with more information and this particular warning ("title should not contain HTML: &amp;") is well worth heeding.

RSS is notoriously deficient when it comes to specifying what should happen when special characters appear in item titles, since it doesn't say whether the title should be parsed as HTML or plain text. People have tried to come up with "best practices" and guidelines for how to deal with this issue but the bottom line is that since it's not explicitly stated what's going on, feed readers can do whatever they want and you, the publisher, have absolutely no control over what the end user sees. If you're generating your own feed, you can kinda sidestep the issue, by not using these special characters, but if you're publishing feeds created from content retrieved from elsewhere, you have to deal with it.

I tried to be Switzerland during the RSS/Atom format wars but people paying attention would have noticed that the feeds generated internally by Awasu were quietly converted over to Atom for the simple reason that it works, RSS doesn't (RSS works most of the time but in cases like this, you're screwed since there's no way to resolve the problem). There's a small learning curve with Atom since it makes the effort to specify things properly and fully (RSS is really easy to learn because it doesn't bother handling special cases like this) but it's not too hard to convert an Awasu template that generates RSS into one that generates Atom and it's worth the effort since your feeds will then work properly :roll: In the olden days, it was an issue that not every feed reader supported Atom but those days are long gone. Have a look at the Atom templates in the Awasu installation directory to see how they work.

The warning about bad characters in the feed (\x95 and \x96) are a similar kind of thing. If you are generating feeds created from content you've retrieved from elsewhere, and those other feeds contain bodgy characters, those bodgy characters will be re-published in your own feeds. Run the feeds you're retrieving through the Feed Validator and you will almost certainly get the same warnings. They typically end up in the feed because somebody's written the text in Microsoft Word or some other funky editor, then pasted it straight into their feed XML without realizing it might contain some gnarly characters (Microsoft's "smart quotes" are one of leading culprits when it comes to invalidating feeds :roll: ).


Tue May 19, 2009 7:49 am
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: encode and bad charactors in RSS
Maybe Jerry meant that his document is "well formed" but not "valid"; there's a definitely a difference; and I know Taka knows the difference.

Quote:
The warning about bad characters in the feed (\x95 and \x96) are a similar kind of thing. If you are generating feeds created from content you've retrieved from elsewhere, and those other feeds contain bodgy characters, those bodgy characters will be re-published in your own feeds.


Is there a way to preprocess someone else's feed and clean up those bodgy characters before it gets to Awasu?
One way would be to create a Channel Plugin that acts as a proxy between the real feed you are subscribe to and Awasu. That's exactly what the Change Feed Encoding Channel Plugin does; but it's only designed to change the character encoding of the feed. A modified version of the Change Feed Encoding Channel Plugin could find and replace bodgy characters before Awasu sees the feed. The source code of the Change Feed Encoding Channel Plugin is available in the ZIP file just waiting for modification.

Is there another way?
It looks like you can use XSLT to process a feed before it gets to Awasu. Even though the "Processing" dialog box lists SOAP as "Feed pre-processing" and XSLT as "Feed post-processing", the help file clearly states that "Channels can have one or more XSLT files associated with them that will be applied to the downloaded feed, modifying it in some way before it is processed by Awasu". So it appears that the feed data stored in Awasu will be the XSLT-modified content.

Does Awasu have any other tricks up it sleeve?
Maybe? It appears that Item Content Filters could easily find and replace the bodgy characters although it sounds like this is only for display purposes in a Channel Summary Template; it won't actually affect the feed data store in Awasu database. So for Jerry, who is working on a Channel Report, this probably won't help.

Any other options?
You might be able to preprocess the feed before it gets to Awasu using Yahoo Pipes. A seemingly simple drag-and-drop, connect the modules, fill in the blanks kind of (free) programming environment that can process feeds and other data sources. I haven't tried it yet but it one my list of things to do. Of course you never know if Yahoo might decide to pull the plug on Pipes (there's gotta be a pun in there somewhere), so you really should continue to work on that Channel Plugin or XSLT solution, but Pipes might help you solve your problem quickly.

Coda
As usual, I think I've overstayed my welcome. So goodnight, and may all your Awasu dreams come true.


Wed May 20, 2009 12:20 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: encode and bad charactors in RSS
kevotheclone wrote:
I know Taka knows the difference.
Yah, he does but sometimes he gets sloppy and says "invalid" when he means "not well-formed" :oops: It's a good thing I didn't start lecturing on the importance of using the correct terminology (I was going to) :roll:

kevotheclone wrote:
Even though the "Processing" dialog box lists SOAP as "Feed pre-processing" and XSLT as "Feed post-processing", the help file clearly states that "Channels can have one or more XSLT files associated with them that will be applied to the downloaded feed, modifying it in some way before it is processed by Awasu". So it appears that the feed data stored in Awasu will be the XSLT-modified content.
Yes, that's correct. And I take your point about the "pre-" and "post-" labels being confusing. They reflect how these features are tacked onto Awasu's "download a feed" code: SOAP requests happen before the feed is downloaded, XSLT processing is done after.

kevotheclone wrote:
It appears that Item Content Filters could easily find and replace the bodgy characters although it sounds like this is only for display purposes in a Channel Summary Template; it won't actually affect the feed data store in Awasu database. So for Jerry, who is working on a Channel Report, this probably won't help.
Also correct, and doubly-so since he is trying to generate some feed XML to re-publish on his site so this wouldn't work for his subscribers.

kevotheclone wrote:
You might be able to preprocess the feed before it gets to Awasu using Yahoo Pipes.
Wow, is that still going? I always meant to take a look at it and never got around to it...


Wed May 20, 2009 7:16 am
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: encode and bad charactors in RSS
Quote:
And I take your point about the "pre-" and "post-" labels being confusing.

It's not too confusing. Since your help file clearly explains when XSLT processing occurs and there's a help file link right in the "Processing" dialog box, I think anybody who delves this deeps into Awasu's feed processing capabilities will read the Help file and figure it out.

Quote:
Wow, is that still going? I always meant to take a look at it and never got around to it...

I finally gave it a try recently, just combining a couple of feeds together and it worked pretty well.

One funny thing regarding "...you never know if Yahoo might decide to pull the plug on Pipes...". Pipes has a pair of video tutorials hosted on Yahoo's "jumpcut" (beta) service. Clicking on the link to jumpcut reveals that Yahoo will be "closing the Jumpcut.com site on June 15, 2009"; so all of these "cloud computing" and "software as a service" concepts are nice, but there's no guarantee that they'll be around from one year to the next.


Fri May 22, 2009 4:26 pm
Profile

Joined: Mon Mar 09, 2009 2:29 pm
Posts: 27
Post 
Hmmm. I wasn't notified of replies (via email) so I haven't checked in for a couple of days so sorry for my delayed response.

I'm going digest all the replies over the weekend as time allows and then decide which way to go with it. But here's a couple of responses to the easy stuff.

Quote:
There's no such thing as XML that validates but contains a few errors XML is very strict on this, either it's 100% valid or it's wrong. I suspect the errors you're referring are coming from the Feed Validator.


Maybe I should have said RSS instead of XML, and yes I am talking about the feed validator accessible via Awasu. What I meant was that the feed validator said congrats your feed is valid but here's a few things you should fix. I figured it wasn't a terrible problem because the browsers and readers are reading but I just don't want some one trying to validate my feed and get those warnings.

Quote:
I tried to be Switzerland during the RSS/Atom format wars but people paying attention would have noticed that the feeds generated internally by Awasu were quietly converted over to Atom for the simple reason that it works, RSS doesn't (RSS works most of the time but in cases like this, you're screwed since there's no way to resolve the problem). There's a small learning curve with Atom since it makes the effort to specify things properly and fully (RSS is really easy to learn because it doesn't bother handling special cases like this) but it's not too hard to convert an Awasu template that generates RSS into one that generates Atom and it's worth the effort since your feeds will then work properly In the olden days, it was an issue that not every feed reader supported Atom but those days are long gone. Have a look at the Atom templates in the Awasu installation directory to see how they work.

The warning about bad characters in the feed (\x95 and \x96) are a similar kind of thing. If you are generating feeds created from content you've retrieved from elsewhere, and those other feeds contain bodgy characters, those bodgy characters will be re-published in your own feeds. Run the feeds you're retrieving through the Feed Validator and you will almost certainly get the same warnings. They typically end up in the feed because somebody's written the text in Microsoft Word or some other funky editor, then pasted it straight into their feed XML without realizing it might contain some gnarly characters (Microsoft's "smart quotes" are one of leading culprits when it comes to invalidating feeds ).


I'm going to reconsider using ATOM instead but there was a reason that I converted an already (almost) ATOM channel report (metadata report) to RSS to begin with but I'm not sure of all the reasons for that at the moment. I believe one was because I use FeedForAll to help me check and convert some feeds and it seems to convert ATOM to RSS when you open XML file up. Another reason is that I am using a few parsers that I was afraid wasn't going to work with ATOM. I'm sure there was a couple of more too but I'm too tired to think right now.

It sounds like ATOM is prefered by Awasu and from what I have read it probably is the better or at least more flexible format.

So I have a lot to think about and digest before I figure out what is the best direction to take.

I generally prefer that I see the crapy data coming in and then fix it so at least I know what is going on. I like to see the raw data as long as there is a reasonable way to fix on the front side.

So I'll report back what my thoughts are and how I'm going to proceed some time next week.

Appreciate all your help and comments, and I hope you guys have a great weekend.

Jerry


Fri May 22, 2009 6:46 pm
Profile WWW
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: encode and bad charactors in RSS
kevotheclone wrote:
Since your help file clearly explains when XSLT processing occurs and there's a help file link right in the "Processing" dialog box

The help link only works if you have already downloaded the CHM and generally speaking, if you need to look in the documentation to figure out how something works, there's something wrong with your UI and/or implementation.

kevotheclone wrote:
but there's no guarantee that they'll be around from one year to the next.

Yah, you have to feel sorry for all those people on GeoCitites :hysterical:


Fri May 22, 2009 9:08 pm
Profile WWW
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post 
jerrymartin wrote:
Another reason is that I am using a few parsers that I was afraid wasn't going to work with ATOM.

Any parser that doesn't handle Atom by now is abandoned and not worth worrying about. You might find some that only support Atom 0.3 instead of 1.0, which is still something of an ongoing issue, unfortunately.


Fri May 22, 2009 9:12 pm
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: encode and bad charactors in RSS
Quote:
...if you need to look in the documentation to figure out how something works, there's something wrong with your UI and/or implementation.


Ok, if you can think of a better way to label the UI components in the Proccesing dialog box, that'd be great. But I didn't really mean to imply that you needed to make any updates. With all of the options that you have packed into Awasu, I think a person may have to RTFM once it a while; I know I have. Especially with the "Processing" dialog box; user definable SOAP and XSLT processing? Name one other product on the market that has these capabilities... I'm waiting... you can't... ok then. This is such a specialized, advanced Awasu only feature, that if a user dares to delve into this area I think they may need to RTFM.

Since we've veered off into this topic, does Awasu use MSXML as the XSLT processor and if not what XSLT processor does it use?
My mind is starting to think of things to do in the XSLT realm before Awasu gets the output. :where's my Dr. Evil emoticon:


Sat May 23, 2009 2:28 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post Re: encode and bad charactors in RSS
kevotheclone wrote:
does Awasu use MSXML as the XSLT processor

Yes it does but it's v4.0 SP2. I started work upgrading Awasu to v6 but it got put on hold. It'd be nice to finish it since I think Vista comes with v6 already installed.

kevotheclone wrote:
My mind is starting to think of things to do in the XSLT realm before Awasu gets the output.

:yikes:


Sat May 23, 2009 6:10 am
Profile WWW

Joined: Mon Mar 09, 2009 2:29 pm
Posts: 27
Post 
Quote:
The help link only works if you have already downloaded the CHM and generally speaking, if you need to look in the documentation to figure out how something works, there's something wrong with your UI and/or implementation.


I remember when I use to write front ends to Client/Server apps for companies like Disney, Raytheon, etc. One of the first things we were often told is that there was no budget for a user manual so the GUI had to be idiot proof. These apps were often to be used by 50,000 users, and they would get minimal training. Being on the road for months at a time with other programmers/consultants we use go out to dinner every night. I would make sure to have quite a few drinks and be somewhat hung over in the morning and If I could still understand my code and GUI in the morning then I was doing a pretty good job. Then for unit testing I would go and tie one on and be really hung over and do the test. If I could still use it without a manual it passed my unit test. :lol:

Joking aside, a good GUI doesn't need a manual but when there are advanced features involved its a different story. The way things are now we advanced users (user not a inet coder) have to learn sooo many applications and sooo many languages that we have barely have time to skim manuals and generally only refer to them when we really have to.

Examples Examples Examples

I can figure out just about anything if I have an example to look at. For example I still haven't figured out how to use the Webscraper because none of the examples seem to work. I'm pretty sure the inis are outdated. I think Awasus GUI is very good and has so many great features, and seeming unlimited add on potential. There are certainly some things that should be added or changed but for the most part it seems like a famililar environment.

You should be able to select a folder and update only those channels in the folder.

You should be able to cancel an update.

If you have a ton of channels set to do not auto update and then run the update all channels the disabled channels still get updated. There should be a way to update all, update disabled, update enabled, update by folder, and you should be able to highlight various channels and update selected. When you have 500+ channels these features are very important.

That's my 2cents for the day.

Jerry
http://www.iqnewsroom.com


Sat May 23, 2009 10:53 am
Profile WWW
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2897
Location: Melbourne, Australia
Post 
jerrymartin wrote:
I would make sure to have quite a few drinks and be somewhat hung over in the morning and If I could still understand my code and GUI in the morning then I was doing a pretty good job.

Mate, we do things a bit differently in Australia. I write the code <i>while</i> I'm drunk, then if I can understand it in the morning, it passes :whistle:

jerrymartin wrote:
I'm pretty sure the inis are outdated.

Me too :roll: I fix them as people report them broken but I don't really have time to monitor them constantly. But as examples for how the regex's are supposed to work, they're probably sufficient, even if they're broken.

I've been wanting to write a new version of WebScrape for a while now, but again it's been a question of time. I find it tricky to use as well and the fact that we don't have the source code to it is also an issue :-(

jerrymartin wrote:
You should be able to select a folder and update only those channels in the folder.

A lot of people ask for more control over updating but I don't quite see why. You configure your channels to update automatically, then you just let Awasu do it's job. If they're set to update every 5 minutes, then they will never be more than a few minutes out of date, and there are very few channels that update so often that this will be an issue (and if it is, RSS is probably the wrong tool to be using). If you have 500 channels, you absolutely don't want to be manually updating this and manually updating that, it's a waste of your time. I rather suspect it's more about people wanting to feel in control and absolutely-to-the-minute up-to-date rather than any real utility value. Updating 500 channels will typically take much longer than 5 minutes if you have new content coming in (and if you don't, the point is moot) so doing it manually is kinda pointless :roll:

jerrymartin wrote:
You should be able to cancel an update.

Why? What's the downside of just letting channels finish their update? It happens in the background, it doesn't really impact anything so why the need to cancel it?

jerrymartin wrote:
If you have a ton of channels set to do not auto update and then run the update all channels the disabled channels still get updated.

I've thought about this but here's the reasoning behind why things are the way they are. There's no such thing as a "disabled" channel, what you're referring to is a channel that has been configured to not update automatically at regular intervals. Hence, when you do an "update all channels", it updates since it's still "active", not "disabled". Now, I could add a switch that marks a channel as "disabled" and never updates but why on earth would someone want to do this? I know I sometimes stay subscribed to blogs that have been abandoned since I want to keep the archived content but updating the feed is harmless since it has no effect. Awasu is an information gathering (and processing) tool, so why would someone want to have a channel but not update it? The only reason I can think of why someone would want to be able to disable channels is if they wanted to update them only manually, but why would you want that? And more to the point, this is such a corner case, is it really worth adding more switches and flags to an already cluttered UI to support it?


Sat May 23, 2009 8:45 pm
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: encode and bad charactors in RSS
Since jerrymartin brought up updating options, I'll just throw out a couple of things that I had in the back of my mind, and Taka you don't even need to qualify them with a reply.

I'm using Awasu at home, where it does not run 100% of the time, I don't even have Awasu set to run when Windows starts (none of my apps startup automatically); so I start manually Awasu whenever I want a flood of information. Sometimes I'd like to tweak a parameter value on a Channel Hook and it would be nice if I could run Awasu with a "NoChannelUpdates" command line switch so that none of the Channels would update, I can make my tweak, shut down Awasu and then restart it manually in "normal" mode again. I could imagine a scenario where a "NoChannelReport" command line swtich might be useful in the same vein. And of course the ability to combine both switches to cause a "don't do anything automatically" mode, would be great. I think I could perform the update a parameter Channel Hook task by hand editing a .CHANNEL file before running Awasu, but I haven't actually tested this yet.

Not a deal breaker to me, just something I encountered once or twice a couple of months ago and thought this would be nice.


Sat May 23, 2009 9:27 pm
Profile

Joined: Mon Mar 09, 2009 2:29 pm
Posts: 27
Post 
Quote:
Mate, we do things a bit differently in Australia. I write the code while I'm drunk, then if I can understand it in the morning, it passes.


lol, that's pretty close to the same thing. You're drunk, I'm hung over, you're feeling good, I'm feeling miserable. Hmmm. I guess we are backwards. :wink:

Quote:
A lot of people ask for more control over updating but I don't quite see why. You configure your channels to update automatically, then you just let Awasu do it's job. If they're set to update every 5 minutes, then they will never be more than a few minutes out of date, and there are very few channels that update so often that this will be an issue (and if it is, RSS is probably the wrong tool to be using). If you have 500 channels, you absolutely don't want to be manually updating this and manually updating that, it's a waste of your time. I rather suspect it's more about people wanting to feel in control and absolutely-to-the-minute up-to-date rather than any real utility value. Updating 500 channels will typically take much longer than 5 minutes if you have new content coming in (and if you don't, the point is moot) so doing it manually is kinda pointless.

Why? What's the downside of just letting channels finish their update? It happens in the background, it doesn't really impact anything so why the need to cancel it?


There a few reasons I'll touch on few that come to mind.
...and keep in mind I'm not using Awasu just to read the news.

1. Let's say you have 500 feeds and a few are just taking forever to update. So you want to cancel and maybe turn those off until you figure out what the problem is.

2. If you're trying to troubleshoot #1 above because you don't know which ones are giving you the trouble you might want to go through a few of them to one by one to see what is up. You may want to do that by folder or by highlighting the ones you want to update.

My problems may be unique because I'm not just reading the news, I'm sending emails, reports, xml, etc to my website.

My problem right now is simply that my update is taking way too long to finish, and I'm always wanting to cancel and turn off the channel I think might be causing it.

After all my testing is done maybe it won't be such an issue but it still would be a welcome enhancement.

One of the best reasons just happened...

As I am writing this my Awasu updates just kicked in, and because I have so many channels Awasu sucks up a lot of my cpu and sometimes I can't do anything else while its running. So the running in the background can cause a problem specially when you have lots of feeds. I plan to have a lot more then 500 so I'm going to be putting Awasu to the test.

I'll probably come up with some more later.

Later Mate, have a good one.

Jerry


Sat May 23, 2009 9:33 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 27 posts ]  Go to page 1, 2  Next

Who is online
Users browsing this forum: No registered users and 4 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron