Skip to content

Repository for the scrapers for the Result Assessment Tool (RAT)

License

Notifications You must be signed in to change notification settings

mmoershe/rat-scrapers

 
 

Repository files navigation

Scraper

The Scraper application is designed to collect search results from various information retrieval systems or search engines.

Setting Up Scrapers

Installation of dependencies for both applications on one server:

  • Install Python packages from the requirements.txt in the root folder:
python -m pip install --no-cache-dir -r requirements.txt
  • The Scraper application provides a framework for scraping search results from different sources. To add new scrapers, follow the provided template, as adding a new scraper can be complex.

  • Define your scraper using the template located at /scrapers/scraper/template_new_scraper.py. The template provides guidance on how to implement a new scraper.

  • Save your new scraper with the desired filename at /scraper/scrapers/your_scraper.py.

Customizing Language and Location

Customizing language and location for scrapers can be challenging. We are working on adapting the browser language in Selenium and using URL parameters to set the location for search engines.

For Google, you can specify the local location using parameters such as hl (language), gl (global location), and uule (URL encoded location). Example URL: https://www.google.com/search?q=biden&hl=en&gl=US&uule=w+CAIQICImV2VzdCBOZXcgWW9yayxOZXcgSmVyc2V5LFVuaXRlZCBTdGF0ZXM%3D

More information on URL parameters:

For Bing, use the parameters cc (location) and setLang (language). Example URL: https://www.bing.com/cc=us&setLang=en

More information on Bing parameters:

Configuring the Selenium Driver

Update the language parameter of your Selenium Driver instance. Every scraper should include the following driver configuration:

driver = Driver(
    browser="chrome",
    wire=True,
    uc=True,
    headless2=headless,
    incognito=False,
    agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    do_not_track=True,
    undetectable=True,
    extension_dir=ext_path,
    locale_code="de"  # Language code for the Driver Instance
)

Available Jupyter Notebooks

We also offer a Jupyter Notebook for setting up and testing a new scraper directly, available at /templates/new_scraper.ipynb.

Test Scripts for Scrapers

  • test_scraper.py Use this script to test individual scrapers, including any newly added ones.

Adding new Scrapers to the main software

  • Requirements for Adding New Scrapers:

  • Register the new scraper in the searchengine table.

    Name Module Result Type Country
    Search Engine Name Python file with scraper (e.g., your_scraper.py) Foreign key to resulttype table (e.g., 1 = organic results) Foreign key to country table (e.g., 1 = Germany)

List of available Scrapers:

  • Microsoft Bing (Country versions: Germany, France, Italy, Poland, Sweden, USA) - Developed by Sebastian Sünkler
  • Google (Country versions: Germany, France, Italy, Poland, Sweden, USA) - Developed by Sebastian Sünkler
  • Google Video (Country versions: Germany, France, Italy, Poland, Sweden, USA) - Developed by Sebastian Sünkler
  • DuckDuckGo (Country versions: Germany, France, Italy, Poland, Sweden, USA) - Developed by Sebastian Sünkler
  • Ecosia (Country versions: Germany, USA) - Developed by Sebastian Sünkler
  • Katalogplus (Country versions: Germany) - Developed by Sebastian Sünkler
  • Brave (Country versions: Germany) - Developed by Sophia Bosnak
  • Dogpile (Country versions: Germany) - Developed by Sophia Bosnak
  • Econbiz (Country versions: Germany) - Developed by Sophia Bosnak
  • Fireball (Country versions: Germany) - Developed by Sophia Bosnak
  • Lycos (Country versions: Germany) - Developed by Sophia Bosnak
  • Mojeek (Country versions: Germany) - Developed by Sophia Bosnak

About

Repository for the scrapers for the Result Assessment Tool (RAT)

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 69.1%
  • CSS 20.4%
  • Python 9.6%
  • Other 0.9%