aparry
Posts: 7
Joined: Tue Mar 25, 2014 6:48 pm
Location: Greenbelt, MD USA

Postby aparry » Mon Jan 18, 2016 6:48 pm

Hi Taka,

We currently use "Check&Get" by ActiveURLs to scrape websites for changes. Specifically, US Congressional committee websites for changes/additions/cancelations to hearing and meeting schedules -- including changes to time, date, titles, location, witnesses, etc. And also the same website monitoring for many "think tank" schedules in the DC area. The nice thing about this program is that it automatically emails us the website changes, with the page embedded in the email -- all changes highlighted in yellow.

The problem we have now is that pages are so dynamic that if their embedded twitter feed updates, the program sends us a "page updated" email. When all we want is the update of the schedules/calendars. Or we just get an embedded webpage of "gobbly gook."

We also tried creating "rss feeds" for some of these pages using Feedity -- since you can visually select the section you want updates on. But the problem with that is that it just sends us a link to the full webpage (no highlights) so we don't really know what changed. Which is time consuming.

Here are just two examples of sites that are no longer working in Check&Get:
Senate Armed Services Committee schedule -
http://www.armed-services.senate.gov/hearings

New America Foundation schedule -
https://www.newamerica.org/tags/events/

Since our non-techies will need to be able to add websites to it on the fly. We need something user-friendly for non-digit-heads (aka regular expressions can be difficult).

Thoughts?
Ann

aparry
Posts: 7
Joined: Tue Mar 25, 2014 6:48 pm
Location: Greenbelt, MD USA

Postby aparry » Mon Jan 18, 2016 8:00 pm

I thought I should add that we are currently monitoring about 100 sites. We'd like to do more if we can find a program that works well.

User avatar
support
Site Admin
Posts: 3021
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Tue Jan 19, 2016 8:38 am

aparry wrote:The problem we have now is that pages are so dynamic that if their embedded twitter feed updates, the program sends us a "page updated" email. When all we want is the update of the schedules/calendars.

As I'm sure you already know, monitoring web pages is an inherently messy business, and so there probably isn't any easy-to-use tool that will just automagically do What You Want. I could write some things that monitor your specific web sites, but I suspect that's not what you want.

Your best bet is probably to contact the Check & Get guys and ask them if it can be configured to ignore certain parts of the page, or just monitor a certain part of the page. It seems like a feature other people would be interested in...

User avatar
kevotheclone
Posts: 239
Joined: Mon Sep 08, 2008 7:16 pm
Location: Elk Grove, California

Postby kevotheclone » Wed Jan 20, 2016 7:54 am

It looks like Check & Get isn't supported anymore; I was going to suggest Copernic Tracker which I used for many years, but it's no longer for sale and will be unsupport April 30st 2016.

FWIW, Taka and I have previously discussed a Webscrape-like Awasu plugin that would either use XPath-like expressions or CSS-selectors to extract the desired parts of an HTML page using one or more of these Python libraries:

XPath-like expressions or CSS-selectors might be a little bit easier to use than regular expressions, but it would still be a messy business and extremely fragile as to web page design changes.

Unfortunately, I didn't get to far with this project before venturing off to something else (typical). :oops:

User avatar
support
Site Admin
Posts: 3021
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Fri Jan 22, 2016 3:10 am

kevotheclone wrote:FWIW, Taka and I have previously discussed a Webscrape-like Awasu plugin that would either use XPath-like expressions or CSS-selectors to extract the desired parts of an HTML page

Yah, I actually have a prototype of this running, where you can point-and-click at parts of a web page and it will generate an RSS feed from just those elements, but it's very rough and nowhere near ready for even an alpha release.

But, as I was researching other tools that did that same kind of thing, I found a bunch of other programs that output CSV, XML etc. So, if aparry can find such a program that she likes, then it would be easy to write something to convert that to RSS.


Return to “Awasu - General Discussion”

Who is online

Users browsing this forum: No registered users and 2 guests