ImageSites.pm
use ImageSites;
# initialize the library's configuration variables
image_library_init(
'scriptname' => $0,
'logfile' => "images.log",
'debug' => 1,
'timestamps' => 1,
'max_errs' => 10,
'pausetime' => 300, # five minutes
'random_wait' => 1,
'do_exponential_backoff' => 1,
'verbose' => 1,
'minheight' => 200,
'minwidth' => 200,
'minarea' => 40000,
'minima_or' => 0,
'save_dir' => "$ENV{HOME}/Pictures/goldenage",
);
# download an HTML page
my $html_text = get_page( "http://goldenagepaintings.blogspot.com" );
die if not $html_text;
# get image URLs from the page
my @images = $html_text =~ m/<img (?: [^>]+ ) src \s* = \s* ['"] ([^'"]+) ['"] (?: [^>]* )>/gix;
# download the images, pausing after each
for my $url ( @images ) {
unless ( $url =~ m/^http/ ) {
$url = "http://goldenagepaintings.blogspot.com/" . $url;
}
my $img_content = get_page( $url );
die if not $img_content;
$url =~ s!.*/!!; # strip everything but filename
save_image( $url, $img_content );
randpause;
}
This is a library of functions that used to be shared between crawl-web-for-images.pl and some other scripts I wrote that were tailored for getting images from specific sites (e.g. epilogue.net and artmagick.com). However, those other scripts became obsolete as the sites in question shut down or changed their design over time in ways that make it hard for a crawler to get anything useful from them, and the general-purpose crawl-web-for-images.pl is all that's left.
Takes a hash of options, does some sanity checking on them, creates a LWP::UserAgent object and a utf8 decoder, and initializes the logfile.
Valid option keys:
debug Turn debug mode on or off. do_exponential_backoff Double sleep time after an unsuccessful GET. logfile File to write messages to. max_errs Exit after this many nonfatal errors. minarea Images must be at least this big. Use too_small to check against these variables. It's the caller's responsibility to figure out the width and height of an image. minheight " minwidth " minima_or The width and height criteria are or'd and not and'd -- if either is satisfied, we say it's big enough. pausetime Baseline sleep time, actual sleep time may vary if random_wait is set random_wait If true, vary the sleep time around the baseline randomly. quiet Print no output to terminal. save_dir Directory to which to save files. scriptname The name of the script that called us, for debugging purposes. timestamps Whether to print timestamps to the log file. verbose Be more detailed in our messages.
Set the specified config variable to the value.
Write the content to the specified file, with some error checking. Use this everything except images, for which see save_image().
Convert a time string which may have a number followed by a unit into a number of seconds. Return -1 if the unit isn't recognized or the argument otherwise doesn't match the regular expression.
Sleep for a configured amount of time. Typically used to wait a bit after a download before hitting the same server again, or before doing another http get on any server.
Initialize the log file.
Write the specified message to the log file and/or the terminal, depending on configuration variables.
If the argument is 1, log a message and exit. If the argument is 0, keep track of how many nonfatal errors we've had and exit only if we've exceeded that.
Write a log message and exit.
Check an image's size against our width, height, area, and minima_or configuration variables. Return 1 if too small, 0 otherwise.
Save the image to the specified path and file. If a file of that name already exists and our new image content is larger, overwrite. If anything goes wrong with saving, return undef, otherwise 1 on success.
Check if two URLs are the same after doing some simple transformations to account for the way equivalent URLs can vary.
Get a page or image from a URL. If the URL ends in an image file extension, but the content-type header indicates it's actually HTML, check the page for links and try to download the actual image (calling ourselves recursively).
If called in scalar context, return the page/image content or undef if something went wrong.
If called in list context, and the page was redirected, return a two-element list of the page content and the redirected URL.
Return the base directory of an URL (stripping off the filename part).
Return the domain name part of an URL.
strict, warnings, Carp, List::Util and Encode are in the standard library. LWP::UserAgent is available from CPAN.
Jim Henry III, http://jimhenry.conlang.org/software/
This script is free software; you may redistribute it and/or modify it under the same terms as Perl itself.