- Introduction
- Software components
- Design challenge
- Technical stack
- How to run unit tests
- How to run lint formatting checks
- Configurations
- Pre-requisites to build and run
- How to build and run
- CI/CD usage
- Logging
- Error handling
- For next steps
This is a scraper which can use to submit a valid URL and scrape following information from the website content.
- HTML version
- Page title
- Number of headings and levels
- Internal links
- External links
- Number of inaccessible links on the page
- If the page contains a login form
The scrape application consists of two main software components as Scraper API and Scraper client.
[Figure 01 - Scraper high level design]
- Scraper API - This is the backend API service for the scraper client web application. This executes the scraping and responds back to the client web application with the information found.
- Scraper client - This is the client facing web application which can access through the browser. This connects with the Scraper API to execute scraping.
The scrape API modularized as follows,
- Handlers
- Scrape handler - Handles the initial scrape request.
- Page handler - Handles subsequent pagination requests.
- Services
- HTML parser - Fetch the HTML content of the given URL and process.
- URL status checker - Checks statuses of URLs found on HTML content.
- Models
- Entity - Models for the data transfer and storage within the application.
- Response - Models to construct API response payloads.
- Storage
- In-memory storage - Holds fetched information mapped to a random unique key.
- Utils
- Helpers - Helper functions.
[Figure 02 - Scraper API high level architecture diagram]
Requirement
After fetching the HTML content we should find all external and internal links from the HTML content and find the number of inaccessible links, if there are any.
Challenge
An HTML page can have numerous amount of hyperlinks and each hyperlink is pointed to a host with some latency. This latency can differ from 100 milliseconds to 10 seconds or more due to many facts.
If user enters a website with hundreds of links, and if our service is trying to check the accessibility of each link before respond back to the initial query, then user might need to wait for a minute or two or even more to see the response. Also, if user submits a website URL with thousands of hyperlinks on the same page, then our system will struggle to process all and user may leave due to the long waiting time. But due to our concurrent nature of implementation and statelessness of the API, the system will still access all hyperlinks, even after user closed the browser.
Therefore, this can cause for a bottleneck in the system and can cause for a system failure easily.
Solution
We designed the scraper-api to process only 10 hyperlinks concurrently at a time. After receiving the initial scraping request the scraper-API fetch the HTML content and collect all required information and store in the in-memory storage. And then, process first 10 hyperlinks and mark the accessibility status and reply back with the response. To process the rest of the hyperlinks, we provided a button to access next 10 hyperlinks subsequently and we accumilate inaccessible hyperlink count and show on the UI.
With this approach, the initial response of scraping a website with around 500 hyperlinks took only between 2-3 seconds and each subsequent hyperlink process request took only 1-2 seconds.
With this approach the application can support websites with considerably high amount of URLs.
- In-memory storage - Since we use in-memory storage to store extracted information, this can make a negative impact when receiving many requests from users to fetch huge websites. Also, can cause for data loss.
- Go language
- Gin framework
- Monkey mock
- Httpmock
- Testify
- Cors
- Docker
- Docker compose
- Github actions
- Go vet
- ReactJs
- TypeScript
- Material UI
- Docker
- Docker
- Docker compose
- Github actions
Due to time constraints, Unit tests are available only in the Scrape API (Go lang) project.
1. Scraper API - GitHub repo
- Go lang
- Clone the github repo using,
[email protected]:go-scraper/scraper-api.git
command. - Open the command line and navigate to the
scraper-api
dir. - Run
go mod tidy
command to install all dependencies listed ingo.mod
file. - To run unit tests with the coverage output run
go test -coverprofile=coverage.out ./...
command. - To view the test coverage run
go tool cover -func=coverage.out
command.
[Figure 03 - Test coverage output]
Due to time constraints, Lint formatting checks are available only in the Scrape API (Go lang) project.
1. Scraper API - GitHub repo
- Go lang
GOPATH
env variable with<go-installed-dir>/golang/go
value.
- Run
./golangci-lint run
command.
- Maximum allowed line length is 100 chars.
You can change configurations on .env
file of each scraper-api and scraper-client.
1. Scraper API - GitHub repo/.env
# Application port
APP_PORT=8080
# Page size for the URL status check
URL_STATUS_CHECK_PAGE_SIZE=10
# Outgoing scrape request timeout
OUT_GOING_SCRAPE_REQ_TIMEOUT=30 # in seconds
# Outgoing URL accessibility check timeout
OUT_GOING_URL_ACCESSIBILITY_CHECK_TIMEOUT=10 # in seconds
2. Scraper Client - GitHub repo/.env
# Application port
PORT=3000
# Scaper API base URL
REACT_APP_SCRAPER_API_BASE_URL=http://localhost:8080
- Docker - Since both software components are Dockerized, having docker service up and running on your machine would be enough to build and run the scraper.
To use the scraper, we have to run both scraper client and the API.
1. Scraper API - GitHub repo
- Clone the github repo using,
[email protected]:go-scraper/scraper-api.git
command if you haven't already. - Open the command line and navigate to the root folder
(scraper-api)
of the project. - To build and run with the docker run
docker-compose up --build
command. - Try sending a
GET
request using Postman or any client to the URLhttp://localhost:8080/scrape?url=https://google.com
. You can have any valid URL to theurl
query parameter. - If the application is up and running without any errors, you should receive a response in below format with the
HTTP status
code200
.
{
"request_id": "20250104020245-hYVHLDhl",
"pagination": {
"page_size": 10,
"current_page": 1,
"total_pages": 2,
"next_page": "/scrape/20250104020245-hYVHLDhl/2"
},
"scraped": {
"html_version": "HTML 5",
"title": "Google",
"headings": {},
"contains_login_form": false,
"total_urls": 19,
"internal_urls": 15,
"external_urls": 4,
"paginated": {
"inaccessible_urls": 0,
"urls": [
{
"url": "https://www.google.com/imghp?hl=en&tab=wi",
"http_status": 200,
"error": ""
},
]
}
}
}
Incase of an error, you should receive an error response in following format with the corresponding HTTP status
code.
{
"error": "Error message"
}
- Next we should build and run the Scraper client application.
2. Scraper client - GitHub repo
- Clone the github repo using,
git clone [email protected]:go-scraper/scraper-client.git
command if you haven't already. - Open the command line and navigate to the root folder
(scraper-client)
of the project. - To build and run with the docker run
docker-compose up --build
command. - Access the web application on your browser using the
http://localhost:3000/
URL.
[Figure 04 - Client application screen]
- Now you can enter a URL on the input box and click on
SCRAPE
button or hitEnter
. You should see the scraping result as below screenshot.
[Figure 05 - Client application screen with scraping results]
- When there are more than 10 hyperlinks detected, accessibility of those will be checked page basis. Therefore, click on the
ACCESS NEXT 10 OF REMAINING...
button to proceed with next 10 hyperlinks. And if there are inaccessible hyperlinks detected, the value will be accumilated to the URL insights data.
Added github workflow YML
files to both scraper API and client repos to verify,
- Dependency download
- Unit tests
- Source build
- Docker build
- Lint formatting
Due to time constraints, Unit tests and Lint formatting checks are available only in the Scrape API (Go lang) project.
Defined custom loggers for the scraper-api
and following log levels are available.
- DEBUG - The printed log line will start with
[scraper-DEBUG]
- INFO - The printed log line will start with
[scraper-INFO]
- ERROR - The printed log line will start with
[scraper-ERROR]
Currently we output logs to the standard output.
We designed the backend to return meaningful errors in error response, when there is a failure. Therefor the frontend client application can show the exact error received through the API.
- Invalid URL format
[Figure 06 - Invalid URL format error]
- Failed to reach the URL error
[Figure 07 - Failed to reach the URL error]
- Request timeout error
[Figure 08 - Request timeout error]
- Network error - This is for the connection between the scraper API and client. ie: If the scraper API is not reachable.
[Figure 09 - Network error]
When we check accessbility of hyperlinks, we return the original error without manipulating to allow user to see the original error and let them to make decisions.
[Figure 10 - An error during the accessibility check]
Below improvements/extensions identified as potential next steps.
- Replace the in-memory storage with a database.
- Use a messaging technique to pass data changes in real-time to the UI.
- Provide a personalized dashboard to see/manage the scraping history.
- Allow users to create scheduled scraping jobs.
- Allow users to setup custom data processors and configure alerts.
- Introduce a pricing model based on supported features/provided resource limits.