Skip to content

Ansumanbhujabal/Linkedin_Scraper

Repository files navigation


LinkedIn Scraper

This project is a LinkedIn scraper built using Python, Selenium, and Redis. The scraper logs into LinkedIn and scrapes profile posts, comments, and metadata such as likes and comments count. Extracted profile URLs are stored in a Redis queue for further processing.

Note: This project is still a work in progress, with certain features yet to be fully implemented, such as scraping 500 profiles and complete Redis functionality.

Architecture

Screenshot from 2024-10-23 17-25-28 Screenshot from 2024-10-23 17-25-51 Screenshot from 2024-10-23 17-26-06

Output

Screenshot from 2024-11-06 08-27-54

Screenshot from 2024-10-23 17-38-22

Features

  • Logs into LinkedIn using provided credentials.
  • Infinite scroll functionality to scrape all posts from a profile.
  • Scrapes post text, post date, number of likes, comments, and more.
  • Extracts profile URLs from the comments section of each post.
  • Stores profile URLs in Redis for further processing.
  • Saves scraped data in JSON files.

Limitations and Future Work

  • LinkedIn Account Restrictions: Due to limitations and restrictions from LinkedIn, this scraper has not been tested on scraping 500 profiles to avoid triggering LinkedIn's security mechanisms and getting my account blocked.
  • Google Captcha / Network Blocking: The scraper may encounter Google captchas, which could block your IP from continuing. This is a potential roadblock when scaling the number of profiles scraped.
  • IP Rotation: To avoid IP blocks and scraping limits, a rotating IP or proxy setup would be needed to scale this project effectively. I have not yet implemented this solution but plan to include it in the future.
  • Redis Queue Functionality: The part of the project involving Redis for queue management is partially implemented. Although URLs are being stored in Redis, the full functionality for profile URL queue processing, tracking, and logging is incomplete and needs further work.

Installation

Prerequisites

  • Python 3.9+
  • Docker
  • Redis

Steps to Run

  1. Clone the repository:

    git clone https://github.com/Ansumanbhujabal/Linkedin_Scraper.git
  2. Build the Docker image:

    docker build -t linkedin-scraper .
  3. Run the Docker container:

    docker run -d linkedin-scraper

    This will launch the scraper inside a Docker container.

  4. To stop the container:

    docker stop <container_id>

Requirements

All Python dependencies are listed in requirements.txt and are installed automatically during the Docker build.

  • Selenium
  • WebDriver Manager
  • Redis
  • Other dependencies listed in requirements.txt

Redis

To start the Redis server locally:

redis-server

Future Improvements

  • Rotating IP Support: Implement IP rotation using proxy services to avoid network blocks from LinkedIn.
  • Complete Redis Integration: Fully implement Redis for managing profile queues and retry mechanisms for failed attempts.
  • Handling LinkedIn Limits: Implement better handling of LinkedIn's rate limits and account restrictions.

Disclaimer

This project is for educational purposes only. Be aware of LinkedIn's terms and conditions regarding web scraping and automated actions. Always ensure that your use of scraping tools complies with applicable terms of service.


License

Usage Restricted to Author


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published