A minimal C# multithreaded web crawler
From an user's point of view, what is needed to start the crawler is a call to the #crawl method on a Crawler instance. Crawler's constructor takes a Cache instance as a parameter, which in turn requires a starting URL and a target directory to be instantiated.
String targetDirectory = @"C:\tenteikura_cache";
Uri startingURL = new Uri("http://www.andreadallera.com");
Cache cache = new Cache(startingURL, targetDirectory);
Crawler crawler = new Crawler(cache);
crawler.Crawl(startingURL); //starts the crawler at http://www.andreadallera.com
Crawler's constructor takes an optional parameter (bool, default false) which, if true, instructs the crawler to fetch pages outside the starting URI's domain or not:
new Crawler(cache, true); //will follow urls outside the starting URI's domain
new Crawler(cache, false); //will fetch only pages inside the starting URI's domain
new Crawler(cache); //same as above
This will only keep the downloaded pages in the Cache object, which is an IEnumerable:
foreach(Page page in cache)
{
Console.WriteLine(page.Title); //page title
Console.WriteLine(page.HTML); //page full HTML
Console.WriteLine(page.Uri); //page URI object
Console.WriteLine(page.Hash); //an hash of the URI's AbsoluteUri
foreach(Uri link in page.Links)
{
//the page has a IEnumerable<Uri> which contains all the links found on the page itself
Console.WriteLine(link.AbsoluteUri);
}
}
Crawler exposes two events - NewPageFetched and WorkComplete:
//fired when a valid page not in cache is downloaded
crawler.NewPageFetched += (page) {
//do something with the fetched page
};
//fired when the crawler has no more pages left to fetch
crawler.WorkComplete += () {
//shut down the application, or forward to the GUI, or whatever
};
If you want to persist the fetched pages, a very rudimental file system backed storage option is available, via the Persister class:
Persister persister = new Persister(targetDirectory, startingURL);
crawler.NewPageFetched += (page) {
persister.save(page);
};
Persister will save the page, in a subdirectory of targetDirectory named after startingURL.Authority, as two files: one file, with filename page.Hash + ".link", contains the page's absolute URI and the other, with filename page.Hash, contains the page itself in full.
There is an example console application on Tenteikura.Example.
There's an hard dependency between Cache and Persister at the moment: Cache expects pages from the targetDirectory + startingUri.Authority path to be in the same format as the ones saved from Persister, while the loading strategy should be injected (and ideally provided by Persister itself).
Persister should use a more effective storage strategy - maybe backed by a RDMS or a documental storage.
The pages are fetched in random order, so there is no traversal priority strategy of any kind.