Tenteikura

A minimal C# multithreaded web crawler

Usage

From an user's point of view, what is needed to start the crawler is a call to the #crawl method on a Crawler instance. Crawler's constructor takes a Cache instance as a parameter, which in turn requires a starting URL and a target directory to be instantiated.

    String targetDirectory = @"C:\tenteikura_cache";
    Uri startingURL        = new Uri("http://www.andreadallera.com");
    Cache cache            = new Cache(startingURL, targetDirectory);
    Crawler crawler        = new Crawler(cache);
    crawler.Crawl(startingURL); //starts the crawler at http://www.andreadallera.com

Crawler's constructor takes an optional parameter (bool, default false) which, if true, instructs the crawler to fetch pages outside the starting URI's domain or not:

    new Crawler(cache, true);  //will follow urls outside the starting URI's domain
    new Crawler(cache, false); //will fetch only pages inside the starting URI's domain
    new Crawler(cache);        //same as above

This will only keep the downloaded pages in the Cache object, which is an IEnumerable:

    foreach(Page page in cache) 
    {
        Console.WriteLine(page.Title);  //page title
        Console.WriteLine(page.HTML);   //page full HTML
        Console.WriteLine(page.Uri);    //page URI object
        Console.WriteLine(page.Hash);   //an hash of the URI's AbsoluteUri
        foreach(Uri link in page.Links) 
        {
            //the page has a IEnumerable<Uri> which contains all the links found on the page itself
            Console.WriteLine(link.AbsoluteUri);
        }
    }

Crawler exposes two events - NewPageFetched and WorkComplete:

    //fired when a valid page not in cache is downloaded    
    crawler.NewPageFetched += (page) {
        //do something with the fetched page
    };
    //fired when the crawler has no more pages left to fetch
    crawler.WorkComplete += () {
        //shut down the application, or forward to the GUI, or whatever
    };

If you want to persist the fetched pages, a very rudimental file system backed storage option is available, via the Persister class:

    Persister persister = new Persister(targetDirectory, startingURL);
    crawler.NewPageFetched += (page) {
        persister.save(page);
    };

Persister will save the page, in a subdirectory of targetDirectory named after startingURL.Authority, as two files: one file, with filename page.Hash + ".link", contains the page's absolute URI and the other, with filename page.Hash, contains the page itself in full.

There is an example console application on Tenteikura.Example.

TO DO

There's an hard dependency between Cache and Persister at the moment: Cache expects pages from the targetDirectory + startingUri.Authority path to be in the same format as the ones saved from Persister, while the loading strategy should be injected (and ideally provided by Persister itself).

Persister should use a more effective storage strategy - maybe backed by a RDMS or a documental storage.

The pages are fetched in random order, so there is no traversal priority strategy of any kind.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Tenteikura.Example		Tenteikura.Example
Tenteikura		Tenteikura
.gitignore		.gitignore
README.md		README.md
Tenteikura.sln		Tenteikura.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tenteikura

Usage

TO DO

About

Releases

Packages

Languages

michaelslz/tenteikura

Folders and files

Latest commit

History

Repository files navigation

Tenteikura

Usage

TO DO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages