WebScrape Plugin Channel
From AwasuWiki
This plugin scrapes a web page and uses regular expressions to extract the information of interest to generate a feed.
Contents |
Instructions
- Download and unpack this file: WebScrape-v1.30a.zip
The recommended location is the ChannelPlugins directory in Awasu's installation directory.
Two programs are provided:
- WebScrape.exe is the plugin itself. Start the Channel Wizard and browse to this file.
- WebScrapeSettings.exe is a utility that can be used to help set up and test the regular expressions needed to scrape a web page.
Full documentation is included in the zip file.
Contributed config files
Backwash.com
Contributed by squeg:
[ChannelParameters] URL=http://www.backwash.com/index.php Title=Backwash: Recent Columns Description=Most recent columns posted to backwash.com BaseUrl= MaxItems=25 Shorthand= SectionPattern=RECENT COLUMNS(.*?)more_columns.gif ItemPattern-1=.*?<font color="#006699"><b>(?P<T>.*?)</b></font> ItemPattern-2=.*?<a href="(?P<L>.*?)" ItemPattern-3=.*?align="left" border="0"></a>(?P<D>.*?)</td>
Dave Barry
Contributed by squeg:
[ChannelParameters] URL=http://www.miami.com/mld/miamiherald/living/columnists/dave_barry/ Title=Dave Barry Description=Dave Barry's most recent columns, hosted at the Miami Herald. BaseUrl=http://www.miami.com/mld/miamiherald/living/columnists/dave_barry/ MaxItems=25 Shorthand= SectionPattern= ItemPattern-1=(.*?RECENT COLUMNS.*?\s*.*?\s*)?<a href="(?P<L>.*?)"\s* ItemPattern-2=.*?class="digest-headline">(?P<T>.*?)</a>.*? ItemPattern-3=\s*(?P<D>.*?)<br>
Futon Critic News
Contributed by jesse:
[ChannelParameters] URL=http://www.thefutoncritic.com/cgi/newswire.cgi BaseURL=http://www.thefutoncritic.com/cgi/ Title=Futon Critic News Description=TV News extracted from Web page MaxItems=15 Shorthand= SectionPattern==<!-- ARTICLE CONTENT BEGINS HERE.*?-->(.*?)<!-- ARTICLE CONTENT ENDS HERE --> ItemPattern-1=<p><span class="bigblue">(?P<D>.*?)</span> ItemPattern-2=<br>.*?<a href="(?P<L>.*?)">(?P<T>.*?)</a> ItemPattern-3=
Press Releases on Futon Critic Home Page
Contributed by jesse:
[ChannelParameters] URL=http://www.thefutoncritic.com/cgi/home.cgi BaseURL=http://www.thefutoncritic.com/cgi/ Title=Futon Critic Home Description=TV PR extracted from Web page MaxItems=15 Shorthand= SectionPattern=highlights:.*?><br>(.*?)</div></td> ItemPattern-1=.*?<a href="(?P<L>.*?)">(?P<T>.*?)>(?P<D>.*?)</a> ItemPattern-2= ItemPattern-3=
New Scientist (Latest News)
Contributed by squeg:
[ChannelParameters] URL=http://www.newscientist.com/news/ Title=New Scientist Description=Latest news from the <i>New Scientist</i> magazine. BaseUrl=http://www.newscientist.com/news MaxItems=30 Shorthand= SectionPattern= ItemPattern-1=.*?class="newslisthead"><a class="textlinks" href="(?P<L>.*?)" ItemPattern-2=>(?P<T>.*?)</a></b></td>.*? ItemPattern-3=\s*<p>(?P<D>.*?)</p>.*?
Youth Specialties (New Articles)
Contributed by jesse:
[ChannelParameters] URL=http://youthspecialties.com/articles/ Title=Youth Specialties Articles Description=New And Featured Articles on Youth Specialties BaseUrl=http://youthspecialties.com MaxItems=30 Shorthand= SectionPattern=<!-- start cont.*?-->(.*?)<!-- end content --> ItemPattern-1=<li>.*?<a href='(?P<L>.*?)'>(?P<T>.*?)</a> ItemPattern-2=</b><br>(?P<D>.*?)<br><br></li> ItemPattern-3=
