Skip to content

blues73/Webmuncher

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maven Central Javadoc

webmuncher is a tool that can be used to easily retrieve all the contents of a website. More accurately, contents under a single domain. This is the perfect use case which reflects the original need for which it was created: Read more about that [here] (http://geekabyte.blogspot.be/2014/12/a-web-scrapercrawler-in-java-krwkrw.html)

webmuncher is available via Maven central, and you can easily drop it into your project with this coordinates:

Maven:

<dependency>
<groupid>com.blogspot.geekabyte.webmuncher</groupid>
<artifactid>webmuncher</artifactid>
<version>${webmuncher.version}</version>
</dependency>

Gradle:

dependencies {
    compile "com.blogspot.geekabyte.webmuncher:webmuncher:$webmuncher.version}"
}

Or you can also build from source and have the built jar in your classpath.

The available releases can be seen [here] (https://github.com/dadepo/webmuncher/releases)

The announcement for the most recent release can be seen here

###How to use webmuncher.

webmuncher is designed around the [Strategy Pattern] (http://en.wikipedia.org/wiki/Strategy_pattern). The main object that would be used is the webmuncher object, while the client using webmuncher would need to provide an implementation of the FetchAction interface which contains code that operates on every fetched page represented by the FetchedPage object

The FetchAction interface has only one method that needs to be implemented. The execute() method. The execute() method is given a FetchedPage object which contains the information extracted from every crawled pages. e.g, the HTML content of the page, the uri of the page, the title of the page, the time it took webmuncher to retrieve the page etc.

Since version 0.1.2 webmuncher comes with utility FetchActions, that makes it easy to persist pages crawled. The included utility actions are:

  1. JDBCAction - for persisting web pages into a relational database. (since 0.1.2)
  2. ElasticSearchAction - for indexing web pages into ElasticSearch. (since 0.1.2)
  3. CSVAction - for saving web pages into a CSV file. (since 0.1.2)

For example, to use webmuncher to extract all the contents of http://www.example.com into a CSV file, you do:

    // Use the builder to build the CSVAction
    CSVAction action = CSVAction.builder()
                .convertToPlainText(true) // converts HTML to plain text
                .setDestination(Paths.get("example-com.csv"))
                .buildAction();

    // creates an instance of the crawler with the action
    webmuncher crawler = new webmuncher(action);

    // Configure the crawler to your hearts desire

    // Crawler will wait 20 seconds between each requests
    crawler.setDelay(20);

    // When at first you don't succeed?
    // Give up and move onto the next one, after 3 attempts!
    crawler.setMaxRetry(3)

    // the crawler would select randomly from the list of user agents
    // you give for each request
    crawler.setUserAgents(Arrays.asList(
      "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6)...",
      "Opera/9.80 (X11; Linux i686; Ubuntu/14.10)...")
    );

    // Provide the list of addresses to use for the referral. So the folks at
    // example.com when checking the webserver logs:sometimes the request
    // comes from google, sometimes, yahoo, sometimes bing...
    crawler.setReferrals(Arrays.asList(
                    "http://www.google.com",
                    "http://www.yahoo.com",
                    "http://www.bing.com"));

        // Start the crawling operation as a blocking call.
        Set<String> strings = crawler.crawl("http://www.example.com");

        // If you want to execute the crawling in another thread,
        // so the current thread does not block, then do:
        Set<String> strings = crawler.crawlAsync("http://www.example.com");

        // in case you do the crawling in another thread,
        // you most likely want to be notified when the
        // crawling operations terminates. in such a case,
        // you should use crawler.onExit(FetchExitCallback callback)
        // to register the callback

The above steps makes use of the CSVAction that comes with the library. In case you have custom operations you want applied to the fetched web pages, then you can easily implement your own FetchAction. for example a JPA backed FetchAction implementation may look like:

class CustomJpaAction implements FetchAction {

        private EntityManager em;
        private EntityManagerFactory emf;

        /**
         * Operates on given {@link com.blogspot.geekabyte.webmuncher.FetchedPage}
         *
         * @param page
         */
        @Override
        public void execute(FetchedPage page) {
            emf = Persistence.createEntityManagerFactory("FetchedPage");
            em = emf.createEntityManager();
            em.getTransaction().begin();

            FetchedPageEntity entity = new FetchedPageEntity();
            entity.setHtml(page.getHtml());
            entity.setLoadTime(page.getLoadTime());
            entity.setStatus(page.getStatus());
            entity.setTitle(page.getTitle());
            entity.setUrl(page.getUrl());
            entity.setSourceUrl(page.getSourceUrl());

            em.persist(entity);
            em.flush();
            em.getTransaction().commit();
        }
}

###Overview of webmuncher API.

The accompanying Javadoc should be helpful in having an overview of the API. It can be gotten using the [Javadoc tool] (http://www.oracle.com/technetwork/articles/java/index-jsp-135444.html) or via Maven using the [Maven Javadoc plugin] (http://maven.apache.org/plugins/maven-javadoc-plugin/).

More conveniently, thanks to Javadoc.io, you can also access the most recent Javadoc online

Licenses

[The MIT License (MIT)] (http://www.opensource.org/licenses/mit-license.php)

About

A web scrapper/crawler in Java

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 94.6%
  • HTML 5.4%