Saturday, December 20, 2008

Grab / Retrieve / Crawling in Perl From web site web page - A Quick Tutorial

Grab / Retrieve / Crawling in Perl From web site web page - A Quick Tutorial
CS 594/494 Fall 2000
by John EblenGetting Started This tutorial shows you the basics of building Perl programs that can access and crawl the web, using Perl's libwww packages. To use these packages, you will need the following lines at the top of your program: use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor; # allows you to extract the links off of an HTML page.
In order to retrieve the contents of a page on the web, you can do the following: $contents = get($URL);
where URL is the URL of the page. $URL should be a full address, such as http://www.perl.com/. The entire contents of the page are then stored in $contents. That was pretty easy, wasn't it? Perhaps a little too easy, you might think, and you would be right. While this method works fine in most cases, it does not allow you to specify parameters for downloading, the most important one being a timeout value. Thus, any program using this method can become stuck on this particular line, waiting forever for a download that will never occur. By using a ctrl-c handler, you can rig the program to stop a download when ctrl-c is pressed, if you don't mind watching your program as it runs. Also, you can try to use a timeout signal, but that does not seem to work on our system (here at UT/CS). A Virtual Browser - a Better Way to Download Pages As you might have guessed, Perl provides a more sophisticated way to get pages. The following lines will initialize a virtual browser that can be used for all your browsing needs and will set the timeout to 10 seconds. Note that these packages take advantage of the object-oriented programming features provided by Perl, so the syntax may look strange to those who are new to Perl or to those who have never studied OO programming in Perl. $browser = LWP::UserAgent->new();
$browser->timeout(10);
Experimentation shows that the actual timing out usually takes a little longer than the amout given here, but at least the browser will time out. The next three lines will download the page indicated by $URL. my $request = HTTP::Request->new(GET => $URL);
my $response = $browser->request($request);
if ($response->is_error()) {printf "%s\n", $response->status_line;}
From an OO programming perspective, here's what happens. The first line creates a request object for $URL. The next line does the actual downloading of that request and returns a response object. The is_error() is a method of the object $response telling us whether or not there is an error, which if there were would be stored in the status_line instance variable of the response object. If you are not familiar with OO programming, you don't really need to understand all the vocabulary used in this paragraph in order to use these packages. Finally, the following retrieves the contents of the downloaded page: $contents = $response->content();
Extracting Links From a Web Page In order to do actual crawling, though, the above is not enough. You need to download web pages, but you also need to be able to further follow the links off of those pages. That is where the LinkExtor module comes in. The following code will take $URL and store the links of that URL into an array, @links: my ($page_parser) = HTML::LinkExtor->new(undef, $URL);
$page_parser->parse($contents)->eof;
@links = $page_parser->links;
Both $URL and $contents must be correct in order for this code to work, as it appends $URL to any relative links it finds in the page.
WARNING: The $URL should end in "/" in order for the appending to be done correctly. The following code will display all of the links found on the page: foreach $link (@links) {print "$$link[2]\n";}
Using What You've Learned With the following tools, you should now be able to build a simple but effective crawler in Perl. For any website you can begin at the main page for that site, download it, extract the links, and repeat the process with each of those links. You can continue until you retrieve no more links inside that domain. Of course there is an issue of exponential growth of links that have to be searched, but by using smart programming, you can control the process. Be sure to restrict the crawler to links inside the original domain, or you could potentially head out and start crawling the entire web! As you may have guessed, this tutorial provides an extremely sketchy, bare bones description of the above packages. Here, you have seen only what you would probably need for your projects and how to do only the most essential operations. You can find much more information inside the module files themselves that implement the packages. These module files end in .pm. So look for response.pm, request.pm, etc. Note: you can find the .pm files on the UTK/CS system at /mix/usr/local/lib/utk_perl5.

No comments: