aparry
Posts: 7
Joined: Tue Mar 25, 2014 6:48 pm
Location: Greenbelt, MD USA

Postby aparry » Wed Jan 13, 2016 4:18 pm

I am trying to use the DownloadFiles channel plugin to harvest the PDF files from this site:
https://www.gpo.gov/fdsys/html/FR/todays_toc.html

For the record, it was working fine up until the past few weeks. Now we get this error message:
"HTTPError: HTTP Error 500: Internal Server Error" (see full error log I attached to email).

I'm assuming it's more of an issue of the site's "anti-spidering," since I can't open it through my php scripting (even with changing user agent setting), or through TeleportUltra. However, we can manually navigate to the site and download files one-by-one with no problems. It is time consuming though. So we'd love to keep it automatic through Awasu. Can you help us?

Thanks!
Ann

User avatar
support
Site Admin
Posts: 3021
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Wed Jan 13, 2016 5:11 pm

I note that the links in the web page have an extra colon in them (after the .gov). I'm not sure if this is legal or not, but I ran a quick hacked-up version of the plugin that removes it, and it seems to work. I'll get a new version to you tomorrow...

aparry
Posts: 7
Joined: Tue Mar 25, 2014 6:48 pm
Location: Greenbelt, MD USA

Postby aparry » Wed Jan 13, 2016 6:20 pm

I noticed that too. I had copied the links into some dummy PHP script to try and download the files (deleting the extraneous colon) and I still got blocked by their server. Even (as a I mentioned above) changing my user_agent setting from PHP to something more benign -- still blocked. But I'm also not a pro - by stretch of imagination - at getting around any of those sites that block access to scrapers, bots, etc.

Thanks so much for your help!

User avatar
support
Site Admin
Posts: 3021
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Wed Jan 13, 2016 6:30 pm

They may have blocked your IP address.

I'll also add something to add a delay between downloading each file.

User avatar
support
Site Admin
Posts: 3021
Joined: Fri Feb 07, 2003 12:48 pm
Location: Melbourne, Australia
Contact:

Postby support » Thu Jan 14, 2016 11:44 am

I've sent you a new version of the plugin.

It has new parameters that let you run a regex over each download URL, then generate a new URL based on what it found. To fix this particular problem, set them as follows:
(*) URL regex = ^(.*)\.gov:(.*)$
(*) URL template = {1}.gov{2}

There is also a new Download delay parameter, that tells the plugin to wait between each download. Even if you haven't been banned, it's polite to set this to something non-zero (e.g. 5 seconds).

aparry
Posts: 7
Joined: Tue Mar 25, 2014 6:48 pm
Location: Greenbelt, MD USA

Postby aparry » Thu Jan 14, 2016 3:53 pm

Works beautifully! Thanks so much!


Return to “Awasu - Bug Reports”

Who is online

Users browsing this forum: No registered users and 5 guests