artlivemedia
Posts: 4
Joined: Wed Jul 12, 2006 11:39 pm

Postby artlivemedia » Thu Jul 13, 2006 2:04 pm

Hi There,

I'm not a coder. I'm a marketer - but the possibilities of web crawler were just too good to resist. Problem is I really think I'm out of my depth on this one. Here is what I want to do and how far I have gotten. I am probably completely off the mark - and hopefully someone will be able to let me know how to get back on track with it all...

There are a number of news websites for Australia and New Zealand industrial/primary industry and high tech market news. They are all run by the same company so I am hoping that if I do one, it will be a lot easier to do the others. The websites are:

http://www.pngindustrynews.net/
http://www.energyreview.net/
http://www.biotechnologynews.net/
http://www.miningnews.net/
http://www.environmentalmanagementnews.net/

So I started with Biotechnologynews.net, downloaded webcrawler, read the instructions, looked at the source code of the .ini file examples and tried to compare to the website I was looking at. There were some similaries but I found the referencing of articles on this site seems fairly complex (or at least to my not-so-technical mind!). If you take a look at the properties of image which accompanies each article, it is sourced from something like http://www.biotechnologynews.net/images/....etc
But the source code shows reference to a story ID which must automatically group the text, links and images together. It's asp and I don't really understand that coding language.

So here's what I came up with:

Code: Select all

[ChannelParameters]
URL=http://www.biotechnologynews.net
BaseURL=
Title= Biotechnology News AU and NZ
Description=RSS news extracted from Web page
MaxItems=15
Shorthand=
SectionPattern=<span>(?P<T>.*?)</span>
ItemPattern-1=<br><span>(.*?)</span>
ItemPattern-2=<br><br><a href=(?P<D>.*?)</a>
ItemPattern-3=<a href=storyview.asp?storyid=(?P<L>.*?)>Full Story...</a><br>


The Section Pattern is in reference to this code on the site:

Code: Select all

RESEARCH</a>&nbsp;</td></tr><tr><td><table><tr><td><table><tr><td><span><a>Stem cell sperm makes baby mice
</a></span><br>


Item Pattern One is in reference to this code:

Code: Select all

<span>(Thursday, July 13, 2006)</span>


Item Pattern Two is in reference to this code:

Code: Select all

<br><br><a>THE publication of a study by researchers in Germany and the UK has shown that sperm cells created from embryonic stem cells can result in the birth of healthy young mice.
</a>


Item Pattern Tree is in reference to this code:

Code: Select all

<a>Full Story...</a>


I think the python code (which I have never tried before today) I used is too simple but because there was a limited amount of info given I'm just not sure what to do.

The webcrawler plugin was half the reason why I chose to go with Awasu so I really want to make this work and learn from it. I intend to do many sites and I promise if you teach me I will post them for everyone else to use!!

Thanks in advance

Michelle

abwilson
Posts: 247
Joined: Sun Feb 09, 2003 12:36 am
Location: San Francisco, CA -- USA

Postby abwilson » Mon Jul 17, 2006 12:49 am

Sorry -- I didn't realize by "Webcrawler" you were referring to WebScrape, so I initially skipped your posting.

The following is pretty close; however, WebScrape expects some form of quotation mark [" or '] to enclose href/img references (which the biotechnology.net page doesn't use), so relative URLs don't work well. You may wish to contact the Webmaster to suggest using quotation marks.

[ChannelParameters]
URL=http://www.biotechnologynews.net
BaseURL=http://www.biotechnologynews.net/
Title=Biotechnology News AU and NZ
Description=RSS news extracted from Web page
MaxItems=15
Shorthand=
SectionPattern=
ItemPattern-1=height=6><br><span><a href=(?P<L>.*?)>
ItemPattern-2=(?P<T>.*?)</a></span><br>
ItemPattern-3=(?P<D>.*?</a><br>)<a


I hope this is of some use...

Allan

artlivemedia
Posts: 4
Joined: Wed Jul 12, 2006 11:39 pm

Postby artlivemedia » Mon Jul 17, 2006 12:53 am

Thanks Allen!

I'll give it a whirl this afternoon and see how it goes - let you know!

Ta,

Michelle


Return to “Awasu - Extensions”

Who is online

Users browsing this forum: No registered users and 1 guest