View unanswered posts | View active topics

Reply to topic  [ 11 posts ] 
Post Automatically downloading URL's found in a feed
Hi,

I just downloaded Awasu Professional and Paid the 95 dollars to unlock the password-verified feed feature. I have a very specific usage-need that I'll try to describe quickly:

1) I log into this feed:

http://fod.xmlteam.com/api/getListings? ... etwork.com

(This feed won't work without my username and password. When I enter them though, the list comes up fine in Awasu)

2) The feed gives me a list of links to Football (American) Box Score files.

3) What I would like to do is have each link 'saved as' to a specific folder. That's it.

Since the links are not direct links to a specific file with a specific file extension (in Chrome they download to "getDocument.xml" and I'm certain they are just auto-generated by XMLTeam's database) I've written a program that renames them using their internal file info. Basically, I don't care in the least what the files are called when downloaded by Awasu, I have a file-rename routine that renames and moves the file, then it's shredded by my XML parser into a SQL database. I just didn't want to write the code to manage auto-downloads and bought your program in hopes of saving time. Is there anyway I can achieve this? I hope so! Please let me know.


Sun Jan 01, 2012 3:37 am
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2906
Location: Melbourne, Australia
Post Re: Automatically downloading URL's found in a feed
bill.b wrote:
What I would like to do is have each link 'saved as' to a specific folder. That's it.

You might have to write a bit of code, but not much.

If the links you want are also the feed item links, you might be able to use Awasu's "download feed items for reading offline" feature (Channel Properties dialog, Advanced tab). Awasu will download the linked-to items and save them in its database, If you open a channel window that has had its feed items downloaded in this way, you'll see each item has a special clickable icon - you can then wget the URL behind that link to get the feed item content. More info here. This will only work if the publisher has set up their feeds in a way that will enable this to work (and they probably won't have :roll:)

Probably easier anyway would be to use Awasu's channel hooks. You can write a script that gets called every time a new feed item arrives, so your script would identify the URL's you're interested in and download them. Downloading a file is a single line of code in Python, or you could just call wget to do it for you. You could even get your script to then automatically push the files into your database :-) An example channel hook can be found in your Awasu installation directory, in $/ChannelHooks/LogChannelActivity.


Sun Jan 01, 2012 3:41 am
Profile WWW

Joined: Mon Jan 02, 2012 10:50 pm
Posts: 4
Post Re: Automatically downloading URL's found in a feed
Thanks for this, seems like the best route is to work up a plugin.

Our data provider has this to say about automatic downloads:
Code:
How to Automate Listing Requests

Because we're adhering to open standards here, you have many options:
Configure an RSS Reader for the automatic retrieval of content
   Conventionally used to read Blogs, there are dozens of RSS Readers on the market. Here are some popular lists of available readers:
      Google
      Blogspace
   Keep in mind that your RSS Reader should support the following features
      Scheduled RSS retrieval (most do this)
      Automatic retrieval of linked documents (Often touted for offline browsing)
      Standard http Authentication (for both the RSS file and the documents -- not all support the latter)
   If you want your software to automatically format and display your downloaded content, you should either:
      Employ software that watches a directory of retrieved files, and formats / processes new content as desired.
      Extend an Open Source RSS Reader to trigger document transformations, database loadings, etc.
   XML Team plans to release a FeedFetcher client application that can be readily configured for scheduled content retrieval and formatting.
      We'll even release technology that loads content into database tables.
   XML Team plans to support a FlexSport On Demand Users Group (affectionately called FODUG), where users can share their ideas and upload shareware


So essentially, they say, use offline browsing it'll work great. Seems like there's a bit more too it.

With the feed loaded I get this error when trying to download each of the files for offline browsing:

"Can't Download to MHTML: Access is denied."

Is this the error I should be expecting? Thanks again,

Bill


Mon Jan 02, 2012 11:13 pm
Profile

Joined: Mon Jan 02, 2012 10:50 pm
Posts: 4
Post Re: Automatically downloading URL's found in a feed
I've been reading over the documentation and I have a C# executable running as a channel plugin. I rewrote the Sample Python script. Mainly just had to do a find/replace.

using System;
using System.Collections.Generic;
using System.Text;

namespace AwasuTestPlugin
{
class Program
{
static void Main(string[] args)
{
string HOME_URL = @"http://www.test.com";
int NFEEDITEMS = 10;

// generate the RSS feed
Console.WriteLine(@"<rss>");
Console.WriteLine(@"<channel>");

Console.WriteLine(@"<title>Sample C# Channel</title>");
Console.WriteLine(@"<link>{0}</link>", HOME_URL);
Console.WriteLine(@"<description>This is a dummy RSS feed generated by the sample C# executable channel plugin.</description>");

for (int i = 1; i < (NFEEDITEMS + 1); i++)
{
Console.WriteLine(@"<item>");
Console.WriteLine(@" <title>Item {0} </title>", i);
Console.WriteLine(@" <link>{0}/item{1}.html</link>", HOME_URL, i);
Console.WriteLine(@" <description>This is a dummy description for item {0}</description>", i);
Console.WriteLine(@"</item>");
Console.WriteLine(@"");
}
Console.WriteLine(@"</channel>");
Console.WriteLine(@"</rss>");
}
}
}

After doing more research I found what I believe is the basic issue. I should have figured it out from the error log but my brain failed lol. Their feed requires HTML validation for the both the feed and the individual files. Feed validation with Awasu seems to work fine, but even with a custom script, would turning a link like this:

http://fod.xmlteam.com/api/getDocuments ... R34535-box

into

http://username:password@fod.xmlteam.co ... R34535-box

work from within Awasu? Or would it be possible to have Awasu re-validate each downloaded file automatically which in theory would fix the issue.


Tue Jan 03, 2012 9:18 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2906
Location: Melbourne, Australia
Post Re: Automatically downloading URL's found in a feed
BillBarnes wrote:
Or would it be possible to have Awasu re-validate each downloaded file automatically which in theory would fix the issue.

From your previous post, it would appear that the publisher has indeed set things up in such a way that would allow Awasu's "download for reading offline" feature to do what you want.

The easiest way to do it would be to write an XSL that inserts your user ID/password into the item URL's i.e. Awasu will download the feed XML, then apply your XSL that changes all the item URL's to have the user ID/password in them. You can then instruct Awasu to download each feed item URL, which will be the XML file you want.

However, given that you have additional steps that need to be done i.e. parsing the XML and storing the results in a database, you might be better off still using a channel hook script. Awasu will call this every time a new feed item arrives, and your script can modify the URL appropriately, download it, then do whatever processing you might want. Your entire process will be completely automated :clap:


Tue Jan 03, 2012 2:31 pm
Profile WWW

Joined: Mon Jan 02, 2012 10:50 pm
Posts: 4
Post Re: Automatically downloading URL's found in a feed
I've been reading up on XSL today but haven't been able to make a dent in changing the item URL. It has been interesting stuff to learn, and I have a new appreciation for XML.

My raw feed with one item looks like:

Code:
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xts="http://xmlteam.com/xts" xmlns="http://purl.org/rss/1.0/" xmlns:sportsml="http://sportsml.com/1.0" version="1.0">
  <channel about="">
    <title>FlexSport On Demand: SportsML</title>
    <link>http://www.xmlteam.com</link>
    <description>
      The world's first Web Service for real-time sports data.
    </description>
    <dc:language>en-us</dc:language>
    <dc:creator>info@xmlteam.com</dc:creator>
    <dc:date/>
  </channel>
  <item about="http://fod.xmlteam.com/api/getDocuments?doc-ids=xt.14965140-box">
  <title>Dallas at NY Giants</title>
  <link>
  http://fod.xmlteam.com/api/getDocuments?doc-ids=xt.14965140-box
  </link>
  <description/>
  <dc:date>20120102T010637-0500</dc:date>
  <sportsml:sports-content-codes>
  <sportsml:sports-content-code code-type="publisher" code-key="sportsnetwork.com"/>
  <sportsml:sports-content-code code-type="priority" code-key="normal"/>
  <sportsml:sports-content-code code-type="sport" code-key="15003000"/>
  <sportsml:sports-content-code code-type="league" code-key="l.nfl.com"/>
  <sportsml:sports-content-code code-type="conference" code-key="c.nfc"/>
  <sportsml:sports-content-code code-type="team" code-key="l.nfl.com-t.19" code-name="New York Giants"/>
  <sportsml:sports-content-code code-type="team" code-key="l.nfl.com-t.18" code-name="Dallas Cowboys"/>
  <sportsml:sports-content-code code-type="revision-id" code-key="l.nfl.com-2011-e.3727-event-stats-sportsnetwork.com"/>
  </sportsml:sports-content-codes>
  </item>
</rdf:RDF>


All I need is a find / replace on the <link> tag to change the value from http://fod.xmlteam.com/... to http://username:password@fod.xmlteam.com/... I just can't seem to get the namespace stuff right or access nodes.

Thanks for any help,

Bill


Tue Jan 03, 2012 11:43 pm
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2906
Location: Melbourne, Australia
Post Re: Automatically downloading URL's found in a feed
BillBarnes wrote:
It has been interesting stuff to learn, and I have a new appreciation for XML.

Interesting isn't the word I'd use for it - I think it's an abomination :mad:

Something like this will work...
Code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rss="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<!-- Matches the root node and starts the ball rolling -->
<xsl:template match="/">
  <xsl:apply-templates />
</xsl:template>

<!-- Matches everything not specified elsewhere and copies it to the output -->
<xsl:template match="node()|@*">
  <xsl:copy>
    <xsl:apply-templates select="@*" />
    <xsl:apply-templates />
  </xsl:copy>
</xsl:template> 

<!-- Matches RDF <link> elements -->
<xsl:template match="/rdf:RDF/rss:item/rss:link">
  <xsl:element name="rss:link">
    <xsl:value-of select="concat('http://USERNAME:PASSWORD@',substring-after(string(.),'http://'))" />
  </xsl:element>
</xsl:template>

</xsl:stylesheet>

Thing is, Awasu is not recognizing the URL properly and so the download feature won't work.


Wed Jan 04, 2012 4:44 am
Profile WWW
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2906
Location: Melbourne, Australia
Post Re: Automatically downloading URL's found in a feed
Here's a really simple channel hook (in Python) to get you started.

Create a file called, say, sports.hook somewhere:
Code:
[Config]
ScriptFilename=sports.py

[Events]
ConsolidatedNewFeedItems=1

In the same directory, create a file called sports.py:
Code:
import os
import sys
import string
import win32api

# initialize
configFilename = sys.argv[1]

# process the each new feed item
feedItemNo = 1
while True :
    itemUrl = win32api.GetProfileVal( "NewFeedItem-"+str(feedItemNo) , "ItemUrl" , "" , configFilename )
    if itemUrl == "" : break
    newItemUrl = itemUrl.replace( "http://" , "http://USERNAME:PASSWORD@" )
   
    # NOTE: Download the XML file and do all your processing here - this sample just logs each URL
    fp = open( "c:/temp/sports.log" , "a" )
    fp.write( "Translated URL: " + newItemUrl + "\r\n" )
    fp.close()
   
    feedItemNo = feedItemNo + 1

Create a new channel and attach this new channel hook to it. Any time new feed items arrive, Awasu will call this script and you can do all your processing there.


Wed Jan 04, 2012 4:59 am
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 227
Location: Elk Grove, California
Post Re: Automatically downloading URL's found in a feed
American Football... Go Niners!

support wrote:
Interesting isn't the word I'd use for it - I think it's an abomination
I couldn't agree more. XSLT is a necessary evil, and "necessary" only when you absolutely must use it. :drevil:

I glad you figured this out; I was just taking a stab at it modifying a copy of the XSLT that I previously wrote for URL rewriting.

:oops: I forgot about using a "default namespace" such as "rss" my XPath expression of "rdf:RDF/item/link" wasn't matching anything. :wall: Looking at your example, I added a default namespace and updated my XPath to "/rdf:RDF/rss:item/rss:link" and it started working.

FWIW here's what I came up with (with your help of course):

Code:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rss="http://purl.org/rss/1.0/">
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:variable name="UserId">MyUserID</xsl:variable>
  <xsl:variable name="Password">MyPassword</xsl:variable>
  <!--
  Matches everything not specified elseware; copies it to the output.
  -->
  <xsl:template match="node()|@*">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>
  <!--
  Matches the root node and "gets the party started".
  -->
  <xsl:template match="/">
    <xsl:apply-templates/>
  </xsl:template>
  <!--
  Matches RSS 1.x <link> elements.
  -->
  <xsl:template match="/rdf:RDF/rss:item/rss:link">
    <xsl:element name="{name(.)}">
      <xsl:value-of select="concat('http://', $UserId, ':', $Password, '@', substring(., 11))"/>
    </xsl:element>
  </xsl:template>
</xsl:stylesheet>


Wed Jan 04, 2012 5:08 am
Profile

Joined: Mon Jan 02, 2012 10:50 pm
Posts: 4
Post Re: Automatically downloading URL's found in a feed
Hell yeah, Go Niners :)

Huge thanks to both of you for the work. Not-shockingly I still have an issue I'm trying to correct. I'm getting links returned like:

http://fod.xmlteam.com/http://USERNAME: ... 965140-box

from:

<xsl:value-of select="concat('http://USERNAME:PASSWORD@',substring-after(string(.),'http://'))" />

or

<xsl:value-of select="concat('http://', $UserId, ':', $Password, '@', substring(., 11))"/>

with an extra "http://fod.xmlteam.com/" at the beginning. Trying to debug it, I couldn't find a definition for string(.) which seems like the stored value in Taka's example.

I haven't tried the python script yet, I'll get python installed now and check it out.

I really appreciate the help, their tech people just emailed tonight and said people tend to use curl. Which means the api docs are all out of whack. Good times!

Bill


Wed Jan 04, 2012 6:55 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2906
Location: Melbourne, Australia
Post Re: Automatically downloading URL's found in a feed
BillBarnes wrote:
an extra "http://fod.xmlteam.com/" at the beginning.

Not quite sure why you're seeing this but give the plugin a go. It's far more flexible and powerful way of doing things and it'll be the better solution in the long-run.


Thu Jan 05, 2012 1:51 pm
Profile WWW
Display posts from previous:  Sort by  
Reply to topic   [ 11 posts ] 

Who is online
Users browsing this forum: No registered users and 5 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to: