GitHub - mknd7/java-web-crawl

Web Crawler in Java

This is a web crawler initially started as part of my university course project.

Create a Crawler object c with a seed URL.
Start crawling by invoking the c.crawl() method.
c.crawl() recursively discovers pages and downloads them to the /crawl directory.
View the resulting crawl using the c.printCrawlMap() method.

Add tests wherever possible!
Option to browse through crawled data interactively, with the option to view metadata for a specific crawled URL
Use jsoup (parser) for fetching URLs - check if it makes things easier
Store crawl metadata results in a DB (MySQL or MongoDB)
Ability to work with multiple Crawlers
- Aggregate results from multiple Crawlers
- Filter out common results from multiple Crawlers
Support for other content / file types (and possibly store) - only HTML supported now
Mechanisms to prevent crashes (eg. limit file size)
Multithreaded crawling/download of child URLs
- Test performance improvements
- See if max depth can be increased to 5
Make this into a CLI app (using Picocli)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
app		app
gradle/wrapper		gradle/wrapper
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle