View unanswered posts | View active topics

Reply to topic  [ 5 posts ] 

Joined: Wed Aug 20, 2008 4:52 pm
Posts: 3
Post channel report consisting of full bodies of news/articles
Hello, is it possible to make a report template that opens the whole item in the feed and saves it in the channel report e.g. removing all text with hyperlinks, removing images and other text with tags that I specify? I want to put the items from the feed to a single file, so that it can be Text-To-Speeched in my TTS converter, and obviously I need only the full text of the news to be saved in the file, without the links, tabs, images, etc. I could adapt the template for each feed considering the differences. Or, can I find a similar template somewhere? Is there any tool that could help editing report templates, some kind of editor similar to MS Frontpage for html or something like that? This might be useful for the users if they could make the templates in the editor and see how the templates work in a preview.


Tue Nov 11, 2008 4:43 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2899
Location: Melbourne, Australia
Post Re: channel report consisting of full bodies of news/article
janbohaty wrote:
Hello, is it possible to make a report template that opens the whole item in the feed and saves it in the channel report e.g. removing all text with hyperlinks, removing images and other text with tags that I specify?

You can do this with any template, just configure the report to show feed content as plain text (underneath where you configure the template file).

janbohaty wrote:
Is there any tool that could help editing report templates, some kind of editor similar to MS Frontpage for html or something like that?

The way I do it is attach the new template to a channel and then edit the template in a normal text editor. Any time I want to see what things look like, I press F9 to refresh the channel summary.


Tue Nov 11, 2008 6:21 am
Profile WWW

Joined: Wed Aug 20, 2008 4:52 pm
Posts: 3
Post 
Ok, but most RSS feeds do not show a complete text of the item when I set plain text for the content style, they only display a short description or an extract instead, what I need is to display (or better copy to the target file) the whole page as displayed in the browser when you click on the item. This way I want to transform the whole feed, all its items (an on-line published newspaper for example) into a single txt file containing continuous text, so that it can be TTS-ed into continuous speech... So, is there any command to include in the template that opens the link to the respective item completely and allows further processing of the content of the page? And can I for example remove all text with hyperlinks (usually these are not a part of the main text of the article). So, the first part of the task would be opening the article page (sometimes there are even more pages, but that's another issue) and the second task would be extracting the relevant text (without hyperlinked text, images, recurring headers, etc.) into a continuous text. The second task could be adapted for the particular feed (you could "teach" Awasu which parts of the page should be removed next time as the pages maintain the same desing usually). This feature would also be useful for transforming news pages into a simple text that can be sent by e-mail as a plain text, not html....


Tue Nov 11, 2008 10:45 am
Profile
Site Admin
User avatar

Joined: Fri Feb 07, 2003 8:48 am
Posts: 2899
Location: Melbourne, Australia
Post 
Sigh, it's never simple , is it?:roll:

What you're asking for is fairly complex and while Awasu doesn't have any built-in feature to do this kind of thing, if you can program a bit, there are a couple of options.

You could write a plugin channel that downloads the feed(s) you are interested in, downloads each linked-to item, strips the HTML and spits out the plain-text to a file somewhere.

Alternatively, Awasu channels can already be configured to download linked-to articles which are accessible via a special URL. So you could do that, then write something that periodically grabs the downloaded articles and converts them to plain-text.

Or you could use the MySQL channel hook to save downloaded items in a database, then periodically scan it for new articles to download and convert.


Tue Nov 11, 2008 11:52 am
Profile WWW
User avatar

Joined: Mon Sep 08, 2008 3:16 pm
Posts: 226
Location: Elk Grove, California
Post Re: channel report consisting of full bodies of news/article
Retrieval of the full text of a web page is something that I wanted in a feed reader, but I couldn’t find anything on the market that supported this feature. Luckily, Awasu has those wonderful Channel Hooks and provides the item(s) URL so that we can retrieve the full text of an web page. Eventually I'll do some work in this area but I'm too busy right now to fully develop this Channel Hook, but here's a simple example to get you started.

This example is triggered by the "NewFeedItem" event, so the Channel Hook will run for each item and save each item to a separate text file using the item's "title" as the file name. This isn't as efficient as the "ConsolidatedNewFeedItems" event, but I didn't have enough time to write the extra code required to download all of the articles and append them to a single text file (I'm still recovering from the flu).

This example removes all HTML tags but does not "…remove all text with hyperlinks…" as your original request, so you may still have the old "Contact Us" link text at the bottom of the output.

@ janbohaty
If you have any HTML/Programming/Regular Expression skills you should be able to adapt this example for your purposes. You can either write Regular Expression patterns that filter out something and use Python's sub() method to replace the matched text with a null string:
Code:
html = re.sub('<head>.+?</head>', '', html)

…or you can write Regular Expression patterns that find the portion of the text that you want to keep throwing away the rest of the text.
Code:
strOut = re.search("<div class=\"content\">.+?</div>", html)
if strOut:
  strOut = strOut.group(0)
else:
  strOut = ""


I've marked off a place in the code where you would add extra filtering criteria.
With all of the different ways that web pages can be constructed, determining what to keep and what to delete is the difficult part of extracting just the relevant text.

And if you're still out there janbohaty, what TTS software are you using? Can you pass input and output files paths to via the command line to automate the TTS conversion process?

Enough rambling, here's the example Channel Hook:

In the Hook file make sure you have "NewFeedItem=1" only, no other event should have a value of 1.
Code:
[Config]
ScriptFilename=GetFullText.py
DisplayName=Get Full Text
AuthorName=kevotheclone
AuthorEmailAddress=PM kevotheclone @ Awasu's Forums
Notes=This channel hook will retrieve the full text of the article referenced in the feed.

[Events]
NewFeedItem=1


Here's the GetFullText.py:
You'll want to change the output folder location.

Code:
import urllib2
import re
import win32api
import sys

# Constants
# The file location where the text files will be saved.
# Change this to whatever works for you, just remember to double the
# backslashes and make sure that it ends with double backslashes.
outputFolder = "C:\\"

# ---------------------------------------------------------------------
# Removes illegal Windows file name characters.
def cleanFileName(fileName):
  fileName = re.sub(r"[\\/\:\*\?<>\"]", "_", fileName)
  return fileName

# ---------------------------------------------------------------------
# Removes some extra characters that Awasu appends to the title
def cleanTitle(name):
  if name == "":
    return "";
  else:
    return  re.sub("%0A\*\d", "", name)

# --- MAIN ------------------------------------------------------------

configFilename = sys.argv[1]

# Get the url of the page referenced by this feed item.
itemURL = win32api.GetProfileVal("NewFeedItem", "ItemUrl", "", configFilename)

# Get the feed item's title.
title = win32api.GetProfileVal("NewFeedItem", "ItemTitle", "", configFilename)
title = cleanTitle(title)
title = cleanFileName(title)

# Download the HTML page.
try:
  response = urllib2.urlopen(itemURL)
except HTTPError, e:
  print 'The server couldn\'t fulfill the request.'
  print 'Error code: ', e.code
except URLError, e:
  print 'We failed to reach a server.'
  print 'Reason: ', e.reason
else:
  html = response.read()

# Remove the <head> element and all of it's child elements
# (This is never displayed in the body of the web page)
  html = re.sub('<head>.+?</head>', '', html)

# ###############################################
# Add extra regular expression based filters here
# ###############################################


# ###############################################
# End of regular expression based filters section
# ###############################################

# Remove all of the remaining HTML tags.
  html = re.sub('<(.|\n)+?>', '', html)

# Replace any HTML "non-breaking space" entity references with a single space.
  html = html.replace("&nbsp;", " ")

  outFileName = outputFolder + title + ".txt"

  file = open(outFileName, "w+")
  file.write(html)
  file.close


I'm not a good Python coder yet, so there's probably a lot of ways to improve this, but it's just a quick proof of concept and it does work.


Tue Nov 18, 2008 5:08 pm
Profile
Display posts from previous:  Sort by  
Reply to topic   [ 5 posts ] 

Who is online
Users browsing this forum: No registered users and 3 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
cron