Web scraping

Scraping Fish, shot-scraper (example), Colly are neat.

Currently exploring Playwright together with AutoScraper for my scraping needs.

Links

Scrapy - Fast high-level web crawling & scraping framework for Python. (Web) (Docs) (Awesome Scrapy) (Random proxy middleware)
Scrapyd - Service for running Scrapy spiders. (Docs)
ScrapydWeb - Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI.
Simple Scraper - Extract data from any website in seconds.
ScrapingBee - Web Scraping API.
Easy web scraping with Scrapy (2019)
A guide to Web Scraping without getting blocked in 2020
Crawlab - Distributed web crawler admin platform for spiders management regardless of languages and frameworks.
hakrawler - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application.
JobFunnel - Tool for scraping job websites, and filtering and reviewing the job listings.
You-Get - Tiny command-line utility to download media contents (videos, audios, images) from the Web.
Universal Reddit Scraper - Scrape Subreddits, Redditors, and comments on posts. A command-line tool written in Python.
Gerapy - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js.
Ask HN: Best practices for ethical web scraping? (2020)
Newscatcher - Programmatically collect normalized news from (almost) any website. (Code)
scrapio - Simple and easy-to-use scraper and crawler in Go.
Colly - Elegant Scraper and Crawler Framework for Go. (Tutorial)
Python Web Scraping with Virtual Private Networks (2020)
extract-news-api - Flask code to deploy an API that pulls structured data from online news articles.
Web Scraper - Scrape websites for text by CSS selector.
List all the broken links on your website
Creating a Robust, Reusable Link-Checker (2020)
micawber - Small library for extracting rich content from urls.
Spider Pro - Easy and cheap way to scrape the internet. (HN)
Website Sitemap Parser
rget - Download URLs and verify the contents against a publicly recorded cryptographic log.
yarl - Yet another URL library.
Apify - Web Scraping, Data Extraction and Automation.
Gumbo - Pure-C HTML5 parser.
What is a present-day web scraping in 2020?
Dataflow Kit - Web scraping. Data extraction tools
Awesome Web Scraping
Common Crawl - Open repository of web crawl data that can be accessed and analyzed by anyone. (HN)
Analysing Petabytes of Websites using Common Crawl (2017)
Cognito Common Crawl - Search the common crawl using lambda functions.
Awesome Open Source Javascript Projects for Web Scraping (2020)
ScrapingAnt - All in One Scraping API. Rotating Proxies. Headless Chrome.
Django Dynamic Scraper - Creating Scrapy scrapers via the Django admin interface.
AutoScraper - Smart, Automatic, Fast and Lightweight Web Scraper for Python.
Spidey - Dead-simple crawler which focuses on ease of use and speed. Return a list of all URls of a web page.
Scraping News and Articles From Public APIs with Python (2020)
LinkedIn Scraper
ScrapeOwl - Simple and affordable web scraping API.
Pholcidae - Tiny python web crawler.
Booking site web scraper - Downloads all of the accommodations for the chosen country and saves them in a file.
Reddit Media Downloader - Scrapes Reddit to download media of your choice.
Web scraping with JS (2020) (HN)
Web scraping that just works with OpenFaaS with Puppeteer (2020)
What Happened to XPath? (2020) (HN)
ScrapingHub - Turn web content into useful data. (GitHub)
extruct - Library for extracting embedded metadata from HTML markup.
Introduction to Scraping in Python (2020)
Test driving a HackerNews scraper with Node.js (2020)
SecretAgent - Web browser that's built for scraping. (Web)
Ulixee - Turns every website into an open API. Access any dataset on the world wide web. (GitHub)
Floki - Simple HTML parser that enables search for nodes using CSS selectors.
NYT Vote Scraper - Scrapes the NYT Votes Remaining Page JSON and commits it back to this repo. Nice use of GitHub actions for git scraping.
Instagram Scraper - Scrapes an instagram user's photos and videos.
Inventory Hunter - Get notified as soon as your next CPU, GPU, or game console is in stock.
Guide on preventing Website Scraping
Bibliographies of the Bibliometric-enhanced Information Retrieval workshops and related other workshops
news-please - Open source, easy-to-use news crawler that extracts structured information from almost any news website.
Web crawling with Python (2020)
Metascraper - Scrape data from websites using Open Graph, HTML metadata & fallbacks. (Docs)
Instaloader - Download pictures (or videos) along with their captions and other metadata from Instagram. (Docs)
trafilatura - Manage URLs and scrape main text and metadata.
Go-Trafilatura - Go package and command-line tool which seamlessly downloads, parses, and scrapes web page data.
htmldate - Find the publication date of web pages.
Filtering links to gather texts on the web (2020)
Evaluating scraping and text extraction tools for Python (2020)
Using sitemaps to crawl websites (2019)
Evaluation of date extraction tools for Python (2020)
jusText - Tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages.
sumy - Module for automatic summarization of text documents and HTML pages.
Voyager - Write your own web crawler/scraper as a state machine in rust.
Trandoshan - Fast, highly configurable, cloud native dark web crawler.
ralger - Makes it easy to scrape a website with R.
Scraping HN content with declarative programming
snscrape - Social networking service scraper in Python. (Fork)
qwarc - Framework for rapidly archiving a large number of URLs with little overhead.
select.rs - Rust library to extract useful data from HTML documents, suitable for web scraping.
Scrapera - Provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.
Visual scraping with Elixir and Crawly (2021)
Headless Chrome Crawler - Distributed crawler powered by Headless Chrome.
Tips for reliable web automation and scraping selectors (2021) (HN)
Web Crawler for scraping Financial data (Article)
Web Scraping 101 with Python (2021) (HN) (HN)
Automatio - No-code Web Automation Tool. Automation Tool to Extract Data From Any Website.
Scaling up a Serverless Web Crawler and Search Engine (2021)
crawler-user-agents - List of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
ant - Web crawler for Go.
SearchScraperAPI - Implementation of an API, which allows you to scrape Google, Bing, Yandex, and Qwant.
Scala Scraper - Scala library for scraping content from HTML pages.
Next.js Web Scraper Playground - Build and test your own web scraper APIs with Next.js API Routes and cheerio. (Web)
Scrapers List
Trafilatura - Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments).
Rarchy - Visual Sitemaps & Website Planning Tool. (HN)
CloudProxy - Hide your scrapers IP behind the cloud. (HN)
FlareSolverr - Proxy server to bypass Cloudflare protection.
Schema API for the Semantic Web - Extract structured content from the semantic web.
DataHen Till - Standalone tool that runs alongside your web scraper, and instantly makes your existing web scraper scalable, maintainable and unblockable. (Web) (HN)
Mastering Web Scraping in Python: Crawling from Scratch (2021) (HN)
Data-Mining Wikipedia for Fun and Profit (2021) (HN)
Wikidata or Scraping Wikipedia (HN)
pyspider - Powerful Spider (Web Crawler) System in Python. (Docs)
Python-Goose - HTML Content / Article Extractor, web scrapping lib in Python.
Dyer - Designed for reliable, flexible and fast web crawling, providing some high-level, comprehensive features without compromising speed.
How to Crawl the Web with Scrapy (2021) (HN)
PageMetaScraper - Page metadata scraper with several fallback strategies.
cariddi - Take a list of domains, crawl URLs and scan for endpoints, secrets, API keys, file extensions, tokens and more.
Super-Simple Scraper - Crawler/scraper based on Go + colly, configurable via JSON.
Gospider - Fast web spider written in Go.
The State Of Web Scraping in 2021 (HN)
trafilatura - Web scraping tool for text discovery and retrieval.
scrapy.js - Web Scraping library for JavaScript built using BeautifulSoup4.
PHP Goose - Readability / HTML Content / Article Extractor & Web Scrapping library written in PHP.
Web scraping by watching requests (2021)
Effortless Crawling with Scrapy with one method (2021)
Avoiding bot detection: How to scrape the web without getting blocked?
crawley - Crawls web pages and prints any link it can find.
grab-site - Archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns.
cloudscraper - Python module to bypass Cloudflare's anti-bot page.
Papercut - Scraping/crawling library for Node.js, written in Typescript.
Marple - Collect links to profiles by username through search engines.
Web Scraping with Go (2021) (Reddit)
Maigret - Collect a dossier on a person by username from thousands of sites.
Notes on Writing Web Scrapers (2021)
Scraping Websites With Logins (2021) (Reddit)
Skan.jl - Scan web pages for changes using Julia & GitHub Actions.
cloudflare-scraper - Package to bypass Cloudflare's protection.
scrapy-poet - Page Object pattern for Scrapy.
Go Download Web - Download an entire website with Go.
linkcheck - Fast link checker.
scrapli - Fast, flexible, sync/async, Python 3.6+ screen scraping client specifically for network devices.
scrapligo - scrapli, but in go.
waybacked - Get URLs from the Wayback Machine. Able to handle large outputs.
changedetection.io - Self-Hosted, Open Source, Change Monitoring of Web Pages.
Jiu - Detect new images and video on social media feeds and dispatch webhooks on updates.
Building a scalable scraper in Rust (2021)
Instagram Scraper - Allows you to scrape posts from a user's profile page, hashtag page, or place.
Scraping without JavaScript using Chromium on AWS Lambda: The Novel (2022)
The State of Web Scraping 2022 (HN)
Chrome File Downloader - Go library for scraping or downloading files bypassing Cloudflare protection and browser checks.
Mechaml - OCaml functional web scraping library.
WikiDump Indexer and Search - Wikipedia dump parser and indexer with search functionality. Made for Information Retrieval and Extraction course.
Xidel - Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching.
web-poet - Web scraping Page Objects core library.
Are Product Hunt's featured products still online today? (2022) (HN)
html2data - Library and cli for extracting data from HTML via CSS selectors.
Hyperlink - Detect invalid and inefficient links on your webpages. Works with local files or websites, on the command line and as a node library.
requests-ip-rotator - Python library to utilize AWS API Gateway's large IP pool as a proxy to generate pseudo-infinite IPs for web scraping and brute forcing.
Pinterest Web Scraper - Scraping Visually Similar Images from Pinterest.
gazpacho - Simple, fast, and modern web scraping library. (Docs)
Hitomi Downloader - Desktop utility to download images/videos/music/text from various websites, and more.
Pinterest Downloader - Download all images/videos from Pinterest user/board/section.
More notes on writing web scrapers (2022) (HN)
scraperlite - Scrape text and HTML based on CSS selectors and save contents to a SQLite database.
Browsertrix Crawler - Run a high-fidelity browser-based crawler in a single Docker container.
pafy - Python library to download YouTube content and retrieve metadata.
So you want to Scrape like the Big Boys? (2021)
Dude - Simple framework for writing a web scraper using Python decorators.
myfaveTT - Download all your TikTok Likes. (HN)
Scraping web pages from the command line with shot-scraper (2022) (HN)
Apify SDK - Scalable web crawling and scraping library for JavaScript.
Extracting web page content using Readability.js and shot-scraper (2022)
Texting Robots: Taming robots.txt with Rust and 34 million tests (2022) (Reddit)
Scraping Instagram (2022) (HN)
Linkedin Scraper - Scrapes Linkedin User Data.
Aeon - Scan the internet for your personal information and modify or remove it.
article-parser - Extract main article, main image and meta data from URL.
Apify SDK - Scalable web crawling and scraping library for JavaScript.
WebParsy - Node.JS library and cli for scraping websites using Puppeteer (or not) and YAML definitions.
Hext - Domain-specific language for extracting structured data from HTML documents.
AutoScrape - Automated, programming-free web scraper for interactive sites.
Portia - Tool that allows you to visually scrape websites without any programming knowledge required.
Surgeon - Declarative DOM extraction expression evaluator.
Ayakashi - Next generation web scraping framework.
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data (2019) (Code)
How To Use HTMLRewriter for Web Scraping (2022)
Brozzler - Distributed browser-based web crawler.
oEmbed Parser - Extract oEmbed data from given webpage.
Proxy scraper and checker - Scrape more than 1K HTTP proxies in less than 2 seconds.
Toutatis - Tool that allows you to extract information from instagrams accounts such as e-mails, phone numbers and more.
Crawl Original Google Images & Youtube Videos
OnlyFans DataScraper - Scrape all the media from an OnlyFans account.
Shot Scraper Template - Quickly create a new GitHub repository that takes automated screenshots of a web page using shot-scraper.
Web Scraping via JavaScript Runtime Heap Snapshots (2022) (HN)
All the Places - Set of spiders and scrapers to extract location information from places that post their location on the internet.
Spider - Multithreaded Web spider crawler written in Rust.
Scrapism 2022 course
Libextract - Extract data from websites using basic statistical magic.
TikTok Scraper & Downloader - Download video posts, collect user/trend/hashtag/music feed metadata, sign URL and etc.
Scraping Airbnb (2022)
Shears - Functional web scraping in TS.
Web scraping with Python open knowledge (HN)
Web scraping Proxy Library for Scrapy (HN)
SLRP - Rotating open proxy multiplexer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

web-scraping.md

web-scraping.md

Web scraping

Links

Files

web-scraping.md

Latest commit

History

web-scraping.md

File metadata and controls

Web scraping

Links