Difference between revisions of "WebScrape Plugin Channel"

From AwasuWiki
Jump to: navigation, search
(Updated the config file for Futon Critic and added another)
Line 49: Line 49:
 
</pre>
 
</pre>
  
===== Futon Critic =====
+
===== Futon Critic News =====
 
Contributed by jesse:
 
Contributed by jesse:
 
<pre>
 
<pre>
 
[ChannelParameters]
 
[ChannelParameters]
URL=http://www.thefutoncritic.com/cgi/gofuton.cgi?action=newswire&id=home
+
URL=http://www.thefutoncritic.com/cgi/newswire.cgi
 
BaseURL=http://www.thefutoncritic.com/cgi/
 
BaseURL=http://www.thefutoncritic.com/cgi/
 
Title=Futon Critic News
 
Title=Futon Critic News
Line 59: Line 59:
 
MaxItems=15
 
MaxItems=15
 
Shorthand=
 
Shorthand=
SectionPattern=<td valign="top" bgcolor="#CCCCCC"><p>.*?-->(.*?)</p> </td>
+
SectionPattern==<!-- ARTICLE CONTENT BEGINS HERE.*?-->(.*?)<!-- ARTICLE CONTENT ENDS HERE -->
ItemPattern-1=<p><span class='bigblue'>(?P<D>.*?)</span>
+
ItemPattern-1=<p><span class="bigblue">(?P<D>.*?)</span>
ItemPattern-2=<br>.*?<a href='(?P<L>.*?)'>(?P<T>.*?)</a>
+
ItemPattern-2=<br>.*?<a href="(?P<L>.*?)">(?P<T>.*?)</a>
 +
ItemPattern-3=
 +
</pre>
 +
 
 +
===== Press Releases on Futon Critic Home Page =====
 +
Contributed by jesse:
 +
<pre>
 +
[ChannelParameters]
 +
URL=http://www.thefutoncritic.com/cgi/home.cgi
 +
BaseURL=http://www.thefutoncritic.com/cgi/
 +
Title=Futon Critic Home
 +
Description=TV PR extracted from Web page
 +
MaxItems=15
 +
Shorthand=
 +
SectionPattern=highlights:.*?><br>(.*?)</div></td>
 +
ItemPattern-1=.*?<a href="(?P<L>.*?)">(?P<T>.*?)>(?P<D>.*?)</a>
 +
ItemPattern-2=
 
ItemPattern-3=
 
ItemPattern-3=
 
</pre>
 
</pre>

Revision as of 01:09, 2 May 2006

This plugin scrapes a web page and uses regular expressions to extract the information of interest to generate a feed.

Instructions

The recommended location is the ChannelPlugins directory in Awasu's installation directory.


Two programs are provided:

  • WebScrape.exe is the plugin itself. Start the Channel Wizard and browse to this file.
  • WebScrapeSettings.exe is a utility that can be used to help set up and test the regular expressions needed to scrape a web page.

Full documentation is included in the zip file.


Contributed config files

Backwash.com

Contributed by squeg:

[ChannelParameters]
URL=http://www.backwash.com/index.php
Title=Backwash: Recent Columns
Description=Most recent columns posted to backwash.com
BaseUrl=
MaxItems=25
Shorthand=
SectionPattern=RECENT COLUMNS(.*?)more_columns.gif
ItemPattern-1=.*?<font color="#006699"><b>(?P<T>.*?)</b></font>
ItemPattern-2=.*?<a href="(?P<L>.*?)"
ItemPattern-3=.*?align="left" border="0"></a>(?P<D>.*?)</td>
Dave Barry

Contributed by squeg:

[ChannelParameters]
URL=http://www.miami.com/mld/miamiherald/living/columnists/dave_barry/
Title=Dave Barry
Description=Dave Barry's most recent columns, hosted at the Miami Herald.
BaseUrl=http://www.miami.com/mld/miamiherald/living/columnists/dave_barry/
MaxItems=25
Shorthand=
SectionPattern=
ItemPattern-1=(.*?RECENT COLUMNS.*?\s*.*?\s*)?<a href="(?P<L>.*?)"\s*
ItemPattern-2=.*?class="digest-headline">(?P<T>.*?)</a>.*?
ItemPattern-3=\s*(?P<D>.*?)<br>
Futon Critic News

Contributed by jesse:

[ChannelParameters]
URL=http://www.thefutoncritic.com/cgi/newswire.cgi
BaseURL=http://www.thefutoncritic.com/cgi/
Title=Futon Critic News
Description=TV News extracted from Web page
MaxItems=15
Shorthand=
SectionPattern==<!-- ARTICLE CONTENT BEGINS HERE.*?-->(.*?)<!-- ARTICLE CONTENT ENDS HERE -->
ItemPattern-1=<p><span class="bigblue">(?P<D>.*?)</span>
ItemPattern-2=<br>.*?<a href="(?P<L>.*?)">(?P<T>.*?)</a>
ItemPattern-3=
Press Releases on Futon Critic Home Page

Contributed by jesse:

[ChannelParameters]
URL=http://www.thefutoncritic.com/cgi/home.cgi
BaseURL=http://www.thefutoncritic.com/cgi/
Title=Futon Critic Home
Description=TV PR extracted from Web page
MaxItems=15
Shorthand=
SectionPattern=highlights:.*?><br>(.*?)</div></td>
ItemPattern-1=.*?<a href="(?P<L>.*?)">(?P<T>.*?)>(?P<D>.*?)</a>
ItemPattern-2=
ItemPattern-3=
New Scientist (Latest News)

Contributed by squeg:

[ChannelParameters]
URL=http://www.newscientist.com/news/
Title=New Scientist
Description=Latest news from the <i>New Scientist</i> magazine.
BaseUrl=http://www.newscientist.com/news
MaxItems=30
Shorthand=
SectionPattern=
ItemPattern-1=.*?class="newslisthead"><a class="textlinks" href="(?P<L>.*?)"
ItemPattern-2=>(?P<T>.*?)</a></b></td>.*?
ItemPattern-3=\s*<p>(?P<D>.*?)</p>.*?
Youth Specialties (New Articles)

Contributed by jesse:

[ChannelParameters]
URL=http://youthspecialties.com/articles/
Title=Youth Specialties Articles
Description=New And Featured Articles on Youth Specialties
BaseUrl=http://youthspecialties.com
MaxItems=30
Shorthand=
SectionPattern=<!-- start cont.*?-->(.*?)<!-- end content -->
ItemPattern-1=<li>.*?<a href='(?P<L>.*?)'>(?P<T>.*?)</a>
ItemPattern-2=</b><br>(?P<D>.*?)<br><br></li>
ItemPattern-3=
Fark.com

Contributed by Taka:

[ChannelParameters]
URL=http://www.fark.com
Title=Fark
Description=Rinse. Repeat. Wipe hands on pants.
BaseUrl=http://www.fark.com
MaxItems=15
Shorthand=
SectionPattern=
ItemPattern-1=<td align=right width="120"><a onMouseOver="w\('(?P<L>.*?)'\);
ItemPattern-2=.*?<td align=left>(?P<T>.*?)</td>
ItemPattern-3=(?P<D>)