Skip to content
@go-scraper

go-scraper

Scraper

Table of content

  1. Introduction
  2. Software components
  3. Design challenge
  4. Technical stack
  5. How to run unit tests
  6. How to run lint formatting checks
  7. Configurations
  8. Pre-requisites to build and run
  9. How to build and run
  10. CI/CD usage
  11. Logging
  12. Error handling
  13. For next steps

Introduction

This is a scraper which can use to submit a valid URL and scrape following information from the website content.

  • HTML version
  • Page title
  • Number of headings and levels
  • Internal links
  • External links
  • Number of inaccessible links on the page
  • If the page contains a login form

Software components

The scrape application consists of two main software components as Scraper API and Scraper client.

Scaper high level flow

[Figure 01 - Scraper high level design]

  1. Scraper API - This is the backend API service for the scraper client web application. This executes the scraping and responds back to the client web application with the information found.
  2. Scraper client - This is the client facing web application which can access through the browser. This connects with the Scraper API to execute scraping.

The scrape API modularized as follows,

  1. Handlers
    • Scrape handler - Handles the initial scrape request.
    • Page handler - Handles subsequent pagination requests.
  2. Services
    • HTML parser - Fetch the HTML content of the given URL and process.
    • URL status checker - Checks statuses of URLs found on HTML content.
  3. Models
    • Entity - Models for the data transfer and storage within the application.
    • Response - Models to construct API response payloads.
  4. Storage
    • In-memory storage - Holds fetched information mapped to a random unique key.
  5. Utils
    • Helpers - Helper functions.

Scrape API highlevel architecture diagram

[Figure 02 - Scraper API high level architecture diagram]

Design challenge

Requirement

After fetching the HTML content we should find all external and internal links from the HTML content and find the number of inaccessible links, if there are any.

Challenge

An HTML page can have numerous amount of hyperlinks and each hyperlink is pointed to a host with some latency. This latency can differ from 100 milliseconds to 10 seconds or more due to many facts.

If user enters a website with hundreds of links, and if our service is trying to check the accessibility of each link before respond back to the initial query, then user might need to wait for a minute or two or even more to see the response. Also, if user submits a website URL with thousands of hyperlinks on the same page, then our system will struggle to process all and user may leave due to the long waiting time. But due to our concurrent nature of implementation and statelessness of the API, the system will still access all hyperlinks, even after user closed the browser.

Therefore, this can cause for a bottleneck in the system and can cause for a system failure easily.

Solution

We designed the scraper-api to process only 10 hyperlinks concurrently at a time. After receiving the initial scraping request the scraper-API fetch the HTML content and collect all required information and store in the in-memory storage. And then, process first 10 hyperlinks and mark the accessibility status and reply back with the response. To process the rest of the hyperlinks, we provided a button to access next 10 hyperlinks subsequently and we accumilate inaccessible hyperlink count and show on the UI.

With this approach, the initial response of scraping a website with around 500 hyperlinks took only between 2-3 seconds and each subsequent hyperlink process request took only 1-2 seconds.

With this approach the application can support websites with considerably high amount of URLs.

Limitations

  • In-memory storage - Since we use in-memory storage to store extracted information, this can make a negative impact when receiving many requests from users to fetch huge websites. Also, can cause for data loss.

Technical stack

Scraper API

  1. Go language
  2. Gin framework
  3. Monkey mock
  4. Httpmock
  5. Testify
  6. Cors
  7. Docker
  8. Docker compose
  9. Github actions
  10. Go vet

Scraper client

  1. ReactJs
  2. TypeScript
  3. Material UI
  4. Docker
  5. Docker
  6. Docker compose
  7. Github actions

How to run unit tests

Due to time constraints, Unit tests are available only in the Scrape API (Go lang) project.

1. Scraper API - GitHub repo

Pre-requisites

  • Go lang

Steps

  1. Clone the github repo using, [email protected]:go-scraper/scraper-api.git command.
  2. Open the command line and navigate to the scraper-api dir.
  3. Run go mod tidy command to install all dependencies listed in go.mod file.
  4. To run unit tests with the coverage output run go test -coverprofile=coverage.out ./... command.
  5. To view the test coverage run go tool cover -func=coverage.out command.

Test coverage output

[Figure 03 - Test coverage output]

How to run lint formatting checks

Due to time constraints, Lint formatting checks are available only in the Scrape API (Go lang) project.

1. Scraper API - GitHub repo

Pre-requisites

  • Go lang
  • GOPATH env variable with <go-installed-dir>/golang/go value.

Steps

  1. Run ./golangci-lint run command.
  • Maximum allowed line length is 100 chars.

Configurations

You can change configurations on .env file of each scraper-api and scraper-client.

1. Scraper API - GitHub repo/.env

# Application port
APP_PORT=8080

# Page size for the URL status check
URL_STATUS_CHECK_PAGE_SIZE=10

# Outgoing scrape request timeout
OUT_GOING_SCRAPE_REQ_TIMEOUT=30 # in seconds

# Outgoing URL accessibility check timeout
OUT_GOING_URL_ACCESSIBILITY_CHECK_TIMEOUT=10 # in seconds

2. Scraper Client - GitHub repo/.env

# Application port
PORT=3000

# Scaper API base URL
REACT_APP_SCRAPER_API_BASE_URL=http://localhost:8080

Pre-requisites to build and run

  • Docker - Since both software components are Dockerized, having docker service up and running on your machine would be enough to build and run the scraper.

How to build and run

To use the scraper, we have to run both scraper client and the API.

1. Scraper API - GitHub repo

  1. Clone the github repo using, [email protected]:go-scraper/scraper-api.git command if you haven't already.
  2. Open the command line and navigate to the root folder (scraper-api) of the project.
  3. To build and run with the docker run docker-compose up --build command.
  4. Try sending a GET request using Postman or any client to the URL http://localhost:8080/scrape?url=https://google.com. You can have any valid URL to the url query parameter.
  5. If the application is up and running without any errors, you should receive a response in below format with the HTTP status code 200.
{
    "request_id": "20250104020245-hYVHLDhl",
    "pagination": {
        "page_size": 10,
        "current_page": 1,
        "total_pages": 2,
        "next_page": "/scrape/20250104020245-hYVHLDhl/2"
    },
    "scraped": {
        "html_version": "HTML 5",
        "title": "Google",
        "headings": {},
        "contains_login_form": false,
        "total_urls": 19,
        "internal_urls": 15,
        "external_urls": 4,
        "paginated": {
            "inaccessible_urls": 0,
            "urls": [
                {
                    "url": "https://www.google.com/imghp?hl=en&tab=wi",
                    "http_status": 200,
                    "error": ""
                },
            ]
        }
    }
}

Incase of an error, you should receive an error response in following format with the corresponding HTTP status code.

{
    "error": "Error message"
}
  1. Next we should build and run the Scraper client application.

2. Scraper client - GitHub repo

  1. Clone the github repo using, git clone [email protected]:go-scraper/scraper-client.git command if you haven't already.
  2. Open the command line and navigate to the root folder (scraper-client) of the project.
  3. To build and run with the docker run docker-compose up --build command.
  4. Access the web application on your browser using the http://localhost:3000/ URL.

Client empty screen

[Figure 04 - Client application screen]

  1. Now you can enter a URL on the input box and click on SCRAPE button or hit Enter. You should see the scraping result as below screenshot.

Client result screen

[Figure 05 - Client application screen with scraping results]

  1. When there are more than 10 hyperlinks detected, accessibility of those will be checked page basis. Therefore, click on the ACCESS NEXT 10 OF REMAINING... button to proceed with next 10 hyperlinks. And if there are inaccessible hyperlinks detected, the value will be accumilated to the URL insights data.

CI/CD usage

Added github workflow YML files to both scraper API and client repos to verify,

  • Dependency download
  • Unit tests
  • Source build
  • Docker build
  • Lint formatting

Due to time constraints, Unit tests and Lint formatting checks are available only in the Scrape API (Go lang) project.

Logging

Defined custom loggers for the scraper-api and following log levels are available.

  1. DEBUG - The printed log line will start with [scraper-DEBUG]
  2. INFO - The printed log line will start with [scraper-INFO]
  3. ERROR - The printed log line will start with [scraper-ERROR]

Currently we output logs to the standard output.

Error handling

We designed the backend to return meaningful errors in error response, when there is a failure. Therefor the frontend client application can show the exact error received through the API.

Errors on scraping the given URL

  1. Invalid URL format

Invalid URL format error

[Figure 06 - Invalid URL format error]

  1. Failed to reach the URL error

Failed to reach the URL error

[Figure 07 - Failed to reach the URL error]

  1. Request timeout error

Failed to reach the URL error

[Figure 08 - Request timeout error]

  1. Network error - This is for the connection between the scraper API and client. ie: If the scraper API is not reachable.

Failed to reach the URL error

[Figure 09 - Network error]

Errors on accessbility check of hyperlinks

When we check accessbility of hyperlinks, we return the original error without manipulating to allow user to see the original error and let them to make decisions.

Accessbility check error

[Figure 10 - An error during the accessibility check]

For next steps

Below improvements/extensions identified as potential next steps.

  • Replace the in-memory storage with a database.
  • Use a messaging technique to pass data changes in real-time to the UI.
  • Provide a personalized dashboard to see/manage the scraping history.
  • Allow users to create scheduled scraping jobs.
  • Allow users to setup custom data processors and configure alerts.
  • Introduce a pricing model based on supported features/provided resource limits.

Popular repositories Loading

  1. scraper-api scraper-api Public

    An API service to scrape a URL and get a summary

    Go

  2. scraper-client scraper-client Public

    The scraper-api web client

    TypeScript

  3. .github .github Public

Repositories

Showing 3 of 3 repositories
  • scraper-api Public

    An API service to scrape a URL and get a summary

    go-scraper/scraper-api’s past year of commit activity
    Go 0 0 0 1 Updated Jan 16, 2025
  • .github Public
    go-scraper/.github’s past year of commit activity
    0 0 0 0 Updated Jan 5, 2025
  • scraper-client Public

    The scraper-api web client

    go-scraper/scraper-client’s past year of commit activity
    TypeScript 0 0 0 0 Updated Jan 5, 2025

Top languages

Loading…

Most used topics

Loading…