go-scraper

Scraper

Table of content

Introduction
Software components
Design challenge
Technical stack
How to run unit tests
How to run lint formatting checks
Configurations
Pre-requisites to build and run
How to build and run
CI/CD usage
Logging
Error handling
For next steps

Introduction

This is a scraper which can use to submit a valid URL and scrape following information from the website content.

HTML version
Page title
Number of headings and levels
Internal links
External links
Number of inaccessible links on the page
If the page contains a login form

Software components

The scrape application consists of two main software components as Scraper API and Scraper client.

[Figure 01 - Scraper high level design]

Scraper API - This is the backend API service for the scraper client web application. This executes the scraping and responds back to the client web application with the information found.
Scraper client - This is the client facing web application which can access through the browser. This connects with the Scraper API to execute scraping.

The scrape API modularized as follows,

Handlers
- Scrape handler - Handles the initial scrape request.
- Page handler - Handles subsequent pagination requests.
Services
- HTML parser - Fetch the HTML content of the given URL and process.
- URL status checker - Checks statuses of URLs found on HTML content.
Models
- Entity - Models for the data transfer and storage within the application.
- Response - Models to construct API response payloads.
Storage
- In-memory storage - Holds fetched information mapped to a random unique key.
Utils
- Helpers - Helper functions.

[Figure 02 - Scraper API high level architecture diagram]

Design challenge

Requirement

After fetching the HTML content we should find all external and internal links from the HTML content and find the number of inaccessible links, if there are any.

Challenge

An HTML page can have numerous amount of hyperlinks and each hyperlink is pointed to a host with some latency. This latency can differ from 100 milliseconds to 10 seconds or more due to many facts.

If user enters a website with hundreds of links, and if our service is trying to check the accessibility of each link before respond back to the initial query, then user might need to wait for a minute or two or even more to see the response. Also, if user submits a website URL with thousands of hyperlinks on the same page, then our system will struggle to process all and user may leave due to the long waiting time. But due to our concurrent nature of implementation and statelessness of the API, the system will still access all hyperlinks, even after user closed the browser.

Therefore, this can cause for a bottleneck in the system and can cause for a system failure easily.

Solution

We designed the scraper-api to process only 10 hyperlinks concurrently at a time. After receiving the initial scraping request the scraper-API fetch the HTML content and collect all required information and store in the in-memory storage. And then, process first 10 hyperlinks and mark the accessibility status and reply back with the response. To process the rest of the hyperlinks, we provided a button to access next 10 hyperlinks subsequently and we accumilate inaccessible hyperlink count and show on the UI.

With this approach, the initial response of scraping a website with around 500 hyperlinks took only between 2-3 seconds and each subsequent hyperlink process request took only 1-2 seconds.

With this approach the application can support websites with considerably high amount of URLs.

Limitations

In-memory storage - Since we use in-memory storage to store extracted information, this can make a negative impact when receiving many requests from users to fetch huge websites. Also, can cause for data loss.

Technical stack

Scraper API

Go language
Gin framework
Monkey mock
Httpmock
Testify
Cors
Docker
Docker compose
Github actions
Go vet

Scraper client

ReactJs
TypeScript
Material UI
Docker
Docker
Docker compose
Github actions

How to run unit tests

Due to time constraints, Unit tests are available only in the Scrape API (Go lang) project.

1. Scraper API - GitHub repo

Pre-requisites

Go lang

Steps

Clone the github repo using, [email protected]:go-scraper/scraper-api.git command.
Open the command line and navigate to the scraper-api dir.
Run go mod tidy command to install all dependencies listed in go.mod file.
To run unit tests with the coverage output run go test -coverprofile=coverage.out ./... command.
To view the test coverage run go tool cover -func=coverage.out command.

[Figure 03 - Test coverage output]

How to run lint formatting checks

Due to time constraints, Lint formatting checks are available only in the Scrape API (Go lang) project.

1. Scraper API - GitHub repo

Pre-requisites

Go lang
GOPATH env variable with <go-installed-dir>/golang/go value.

Steps

Run ./golangci-lint run command.

Maximum allowed line length is 100 chars.

Configurations

You can change configurations on .env file of each scraper-api and scraper-client.

1. Scraper API - GitHub repo/`.env`

# Application port
APP_PORT=8080

# Page size for the URL status check
URL_STATUS_CHECK_PAGE_SIZE=10

# Outgoing scrape request timeout
OUT_GOING_SCRAPE_REQ_TIMEOUT=30 # in seconds

# Outgoing URL accessibility check timeout
OUT_GOING_URL_ACCESSIBILITY_CHECK_TIMEOUT=10 # in seconds

2. Scraper Client - GitHub repo/`.env`

# Application port
PORT=3000

# Scaper API base URL
REACT_APP_SCRAPER_API_BASE_URL=http://localhost:8080

Pre-requisites to build and run

Docker - Since both software components are Dockerized, having docker service up and running on your machine would be enough to build and run the scraper.

How to build and run

To use the scraper, we have to run both scraper client and the API.

1. Scraper API - GitHub repo

Clone the github repo using, [email protected]:go-scraper/scraper-api.git command if you haven't already.
Open the command line and navigate to the root folder (scraper-api) of the project.
To build and run with the docker run docker-compose up --build command.
Try sending a GET request using Postman or any client to the URL http://localhost:8080/scrape?url=https://google.com. You can have any valid URL to the url query parameter.
If the application is up and running without any errors, you should receive a response in below format with the HTTP status code 200.

{
    "request_id": "20250104020245-hYVHLDhl",
    "pagination": {
        "page_size": 10,
        "current_page": 1,
        "total_pages": 2,
        "next_page": "/scrape/20250104020245-hYVHLDhl/2"
    },
    "scraped": {
        "html_version": "HTML 5",
        "title": "Google",
        "headings": {},
        "contains_login_form": false,
        "total_urls": 19,
        "internal_urls": 15,
        "external_urls": 4,
        "paginated": {
            "inaccessible_urls": 0,
            "urls": [
                {
                    "url": "https://www.google.com/imghp?hl=en&tab=wi",
                    "http_status": 200,
                    "error": ""
                },
            ]
        }
    }
}

Incase of an error, you should receive an error response in following format with the corresponding HTTP status code.

{
    "error": "Error message"
}

Next we should build and run the Scraper client application.

2. Scraper client - GitHub repo

Clone the github repo using, git clone [email protected]:go-scraper/scraper-client.git command if you haven't already.
Open the command line and navigate to the root folder (scraper-client) of the project.
To build and run with the docker run docker-compose up --build command.
Access the web application on your browser using the http://localhost:3000/ URL.

[Figure 04 - Client application screen]

Now you can enter a URL on the input box and click on SCRAPE button or hit Enter. You should see the scraping result as below screenshot.

[Figure 05 - Client application screen with scraping results]

When there are more than 10 hyperlinks detected, accessibility of those will be checked page basis. Therefore, click on the ACCESS NEXT 10 OF REMAINING... button to proceed with next 10 hyperlinks. And if there are inaccessible hyperlinks detected, the value will be accumilated to the URL insights data.

CI/CD usage

Added github workflow YML files to both scraper API and client repos to verify,

Dependency download
Unit tests
Source build
Docker build
Lint formatting

Due to time constraints, Unit tests and Lint formatting checks are available only in the Scrape API (Go lang) project.

Logging

Defined custom loggers for the scraper-api and following log levels are available.

DEBUG - The printed log line will start with [scraper-DEBUG]
INFO - The printed log line will start with [scraper-INFO]
ERROR - The printed log line will start with [scraper-ERROR]

Currently we output logs to the standard output.

Error handling

We designed the backend to return meaningful errors in error response, when there is a failure. Therefor the frontend client application can show the exact error received through the API.

Errors on scraping the given URL

Invalid URL format

[Figure 06 - Invalid URL format error]

Failed to reach the URL error

[Figure 07 - Failed to reach the URL error]

Request timeout error

[Figure 08 - Request timeout error]

Network error - This is for the connection between the scraper API and client. ie: If the scraper API is not reachable.

[Figure 09 - Network error]

Errors on accessbility check of hyperlinks

When we check accessbility of hyperlinks, we return the original error without manipulating to allow user to see the original error and let them to make decisions.

[Figure 10 - An error during the accessibility check]

For next steps

Below improvements/extensions identified as potential next steps.

Replace the in-memory storage with a database.
Use a messaging technique to pass data changes in real-time to the UI.
Provide a personalized dashboard to see/manage the scraping history.
Allow users to create scheduled scraping jobs.
Allow users to setup custom data processors and configure alerts.
Introduce a pricing model based on supported features/provided resource limits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go-scraper

Scraper

Table of content

Introduction

Software components

Design challenge

Limitations

Technical stack

Scraper API

Scraper client

How to run unit tests

1. Scraper API - GitHub repo

Pre-requisites

Steps

How to run lint formatting checks

1. Scraper API - GitHub repo

Pre-requisites

Steps

Configurations

1. Scraper API - GitHub repo/`.env`

2. Scraper Client - GitHub repo/`.env`

Pre-requisites to build and run

How to build and run

1. Scraper API - GitHub repo

2. Scraper client - GitHub repo

CI/CD usage

Logging

Error handling

Errors on scraping the given URL

Errors on accessbility check of hyperlinks

For next steps

Popular repositories Loading

Repositories

People

Top languages

Most used topics

Scraper

Table of content

Introduction

Software components

Design challenge

Limitations

Technical stack

Scraper API

Scraper client

How to run unit tests

1. Scraper API - GitHub repo

Pre-requisites

Steps

How to run lint formatting checks

1. Scraper API - GitHub repo

Pre-requisites

Steps

Configurations

1. Scraper API - GitHub repo/.env

2. Scraper Client - GitHub repo/.env

Pre-requisites to build and run

How to build and run

1. Scraper API - GitHub repo

2. Scraper client - GitHub repo

CI/CD usage

Logging

Error handling

Errors on scraping the given URL

Errors on accessbility check of hyperlinks

For next steps

Popular repositories Loading

Repositories

People

Top languages

Most used topics

1. Scraper API - GitHub repo/`.env`

2. Scraper Client - GitHub repo/`.env`