Page 1 of 2

Reading of mall formed XML

Posted: Wed May 10, 2006 7:15 am
by Eddy
Hi,

Would it be possible to force Awasu to read a feed with mallformed XML and be less strict in things?
This feed http://weblogs.sdn.sap.com/feeds/comments_rss.csp is rather problematic within Awasu. I've asked the owners to fix that, but their answer is that it can be read by all the other readers and not directly fixing stuff.
They are right about the other RSS readers. I've been testing about 10 over the time and none had problems with the mentioned feed. Only Awasu does have problems with it.
Awasu is my preferable RSS reader and it comes to the situation that I use two readers: one for the failing feed and Awasu for the rest. That's far from an ideal situiation.

Eddy

Re: Reading of mall formed XML

Posted: Wed May 10, 2006 8:27 am
by support
Eddy wrote:Would it be possible to force Awasu to read a feed with mallformed XML and be less strict in things?

I'm sympathetic to these kind of things and Awasu already tries to work around some of the more common errors.

However, working around this particular one is almost guaranteed to cause problems later. I'm surprised so many other readers are accepting it because it's most definitely wrong :-)

One way to work around it would be to write a plugin that fixed up errors like this and sent the corrected feed to Awasu, much like this one that translates encodings Awasu doesn't know about into one that it does. Any takers...? :-)

Re: Reading of mall formed XML

Posted: Wed May 10, 2006 8:31 am
by support
Eddy wrote:I've asked the owners to fix that, but their answer is that it can be read by all the other readers and not directly fixing stuff.

BTW, if the publishers insist that their feed is OK, send them to feedvalidator.org. It flags the error :-)

Re: Reading of mall formed XML

Posted: Wed May 10, 2006 8:42 am
by support
Eddy wrote:Would it be possible to force Awasu to read a feed with mallformed XML and be less strict in things?

One of the things about working on a program for as long as I have with Awasu is that you sometimes forget about the really cool hidden features you wrote ages ago :oops:

With Awasu not running, find the channel's .CHANNEL file in your user's <tt>Channels</tt> sub-directory (e.g. <tt>C:\Program Files\Awasu\Users\YOUR-NAME\Channels</tt>) and open it up in Notepad. Find the line that says <tt>EncodingOverride=</tt> and change it to <tt>EncodingOverride=iso-8859-1</tt>. This forces Awasu to use the encoding the feed should be using, not the one it is actually using.

That did the thrick

Posted: Wed May 10, 2006 9:07 am
by Eddy
That is indeed working well. Many thanks.

Eddy

Bad luck

Posted: Wed May 10, 2006 11:42 am
by Eddy
Hi,

It seems that it doesn't fix all the problems:
10mei06 12:30:13 Comments error: Can't update the channel.
- XML parse failed (11:L46:C17): undefined entity
Feedvalidator indicates this is the prob:
<dc:creator>J&rgen Mayer</dc:creator>
Undefined named entity: uuml


Probably things like Hi\x85 will give problems too.
Eddy

Re: Bad luck

Posted: Wed May 10, 2006 12:39 pm
by support
Eddy wrote:Undefined named entity: uuml

Bugger. Awasu has a workaround in place for exactly this kind of thing but it doesn't work in certain situations (like this one :oops:).

Eddy wrote:Probably things like Hi\x85 will give problems too.

Actually, this one is fixed by the EncodingOverride fix described earlier.

Can you fix that?

Posted: Wed May 10, 2006 1:00 pm
by Eddy
I would be very grateful indeed.

Re: Can you fix that?

Posted: Wed May 10, 2006 1:16 pm
by support
Eddy wrote:I would be very grateful indeed.

I'll put it on The List :whip:

In the meantime, this would be easily fixed by a plugin that converted all SGML entities into numeric entities.

Does that plugin already exist?

Posted: Wed May 10, 2006 1:25 pm
by Eddy
If so, where can I find it? I dodn't see it in the available plugins.

Re: Does that plugin already exist?

Posted: Wed May 10, 2006 1:48 pm
by support
Eddy wrote:If so, where can I find it? I dodn't see it in the available plugins.

That's because it hasn't been written yet! :-)

But, I've whipped up a quick hack (the operative word being hack - I have to go out in about 10 minutes :-)).

(*) You must have Python and the WIN32 extensions installed.

(*) Save this file somewhere as <tt>TranslateSgmlEntities.plugin</tt>:

Code: Select all

[Config]
AuthorName=Awasu
AuthorEmailAddress=support@awasu.com
PluginNotes=Translates SGML entities into their equivalent numeric entities.

' ---------------------------------------------------------------------

[ChannelParameterDefinition-1]
Name=DownloadUrl
Type=string
DefaultValue=
Description=The URL of the feed to be translated.


(*) Save this file in the <u>same directory</u> as <tt>TranslateSgmlEntities.py</tt>:

Code: Select all

import sys
import win32api
import string

# --- GLOBAL DATA -----------------------------------------------------

gSgmlEntities = {}
gSgmlEntities["&uuml;"] = 0x00DC
gSgmlEntities["&oslash;"] = 0x00D8

# --- MAIN ------------------------------------------------------------

def doMain( configFilename ) :

    # get the downloaded feed
    feedFilename = win32api.GetProfileVal( "System" , "DownloadUrlFile" , "" , configFilename )
    feedBuf = open(feedFilename,"r").read()

    # translate SGML entities
    # FIXME! We should restrict the search to content fields only i.e. title, description, etc.
    # FIXME! Should use re's to make this a bit more efficient!
    for sgmlEntity in gSgmlEntities :
        entityVal = gSgmlEntities[ sgmlEntity ]
        feedBuf = string.replace( feedBuf , sgmlEntity , "&#"+str(entityVal) )
       
    # output the translated feed
    print feedBuf
   
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# run the script
sys.tracebacklimit = 0
doMain( sys.argv[1] )


(*) Start the Channel Wizard and browse to the .PY file. Go to the next page and enter the feed URL.

(*) Awasu won't be able to parse the feed (because of the encoding problem) but continue on with the wizard to the end.

(*) Once the channel has updated and the new window appears (with nothing in it), exit Awasu. Find the channel's .CHANNEL file and apply the <tt>EncodingOverride=</tt> fix described earlier.

(*) Restart Awasu and update the channel.

NOTE: The offending SGML entities have been hard-coded in the Python script so you'll have to manually update the list at the top of the .PY file as the publisher introduces more illegal characters into their feed :roll:

Let it never be said that there is something Awasu can't do! :clap:

Wonderful

Posted: Thu May 11, 2006 7:25 am
by Eddy
It works great.
It seems that EncodingOverride=iso-8859-1 isn't needed anymore or should I add it anyway?

Btw. I get always the 'Developers are naive' popup :lol:

Re: Wonderful

Posted: Thu May 11, 2006 10:10 am
by support
Eddy wrote:It works great.
It seems that EncodingOverride=iso-8859-1 isn't needed anymore or should I add it anyway?

It probably wouldn't be a bad idea. They haven't fixed their feed yet in this regard so it's probably only a matter of time before the same problem re-appears.

Eddy wrote:Btw. I get always the 'Developers are naive' popup :lol:

Um, what popup would that be...? :oops:

Popup

Posted: Thu May 11, 2006 10:22 am
by Eddy
Hi,

It happens each time I enter or do something in that channel. I've mailed a screendump to you.

Eddy

Re: Popup

Posted: Fri May 12, 2006 6:36 am
by support
Eddy wrote:It happens each time I enter or do something in that channel.

Is it still happening? I'm not seeing it and there's nothing in the feed that would explain it.

My best guess would be somebody embedded something in one of the items which has since dropped off the end of the feed. Awasu will be getting a "safe mode" soon to strip this kind of thing out.