This is a Python application that scrapes job postings from LinkedIn and stores them in a SQLite database. The application also provides a web interface to view the job postings and mark them as applied or hidden.
If you spent any amount of time looking for jobs on LinkedIn you know how frustrating it is. The same job postings keep showing up in your search results, and you have to scroll through pages and pages of irrelevant job postings to find the ones that are relevant to you, only to see the ones you applied for weeks ago. This application aims to solve this problem by scraping job postings from LinkedIn and storing them in a SQLite database. You can filter out job postings based on keywords in Title and Description (tired of seeing Clinical QA Manager when you search for software QA jobs? Just filter out jobs that have "clinical" in the title). The jobs are sorted by date posted, not by what LinkedIn thinks is relevant to you. No sponsored job posts. No duplicate job posts. No irrelevant job posts. Just the jobs you want to see.
If you are using this application, please be aware that LinkedIn does not allow scraping of its website. Use this application at your own risk. It's recommended to use proxy servers to avoid getting blocked by LinkedIn (more on proxy servers below).
- Python 3.6 or higher
- Flask
- Requests
- BeautifulSoup
- Pandas
- SQLite3
- Clone the repository to your local machine.
- Install the required packages using pip:
pip install -r requirements.txt
- Create a
config.json
file in the root directory of the project. See theconfig.json
section below for details on the configuration options. Config_example.json is provided as an example, feel free to use it as a template. - Run the application using the command
python app.py
. - Open a web browser and navigate to
http://localhost:5000
to view the job postings.
The application consists of two main components: the scraper and the web interface.
The scraper is implemented in main.py
. It scrapes job postings from LinkedIn based on the search queries and filters specified in the config.json
file. The scraper removes duplicate and irrelevant job postings based on the specified keywords and stores the remaining job postings in a SQLite database.
To run the scraper, execute the following command:
python main.py
The web interface is implemented using Flask in app.py
. It provides a simple interface to view the job postings stored in the SQLite database. Users can mark job postings as applied or hidden, and the changes will be saved in the database.
When the job is marked as "applied" it will be highlighted in light blue so that it's obvious at a glance which jobs are applied to. Upon clicking "Hide" the job will dissappear from the list. There's currently no functionality to unhide or un-apply. To reverse it you'd have to go to the database and change values in applied and hidden columns.
To run the web interface, execute the following command:
python app.py
Then, open a web browser and navigate to http://localhost:5000
to view the job postings.
The config.json
file contains the configuration options for the scraper and the web interface. Below is a description of each option:
proxies
: The proxy settings for the requests library. Set thehttp
andhttps
keys with the appropriate proxy URLs.headers
: The headers to be sent with the requests. Set theUser-Agent
key with a valid user agent string. If you don't know your user agen, google "my user agent" and it will show it.search_queries
: An array of search query objects, each containing the following keys:keywords
: The keywords to search for in the job title.location
: The location to search for jobs.f_WT
: The job type filter. Values are as follows: - 0 - onsite - 1 - hybrid - 2 - remote - empty (no value) - any one of the above.
desc_words
: An array of keywords to filter out job postings based on their description.title_only
: boolean (true/false) value that controls how job filtering is done:- true: ONLY jobs that have at least one of the words from 'title_words' in its title will be considered, the rest will be discarded
- false: jobs that have ANY of the word from 'title_words' will be discarded, the rest will be scraped.
title_words
: An array of keywords to filter job postings based on their title and based on 'title_only' value.timespan
: The time range for the job postings. "r604800" for the past week, "r84600" for the last 24 hours. Basically "r" plus 60 * 60 * 24 * .jobs_tablename
: The name of the table in the SQLite database where the job postings will be stored.filtered_jobs_tablename
: The name of the table in the SQLite database where the filtered job postings will be stored.db_path
: The path to the SQLite database file.pages_to_scrape
: The number of pages to scrape for each search query.
- Add functionality to unhide and un-apply jobs.
- Add functionality to sort jobs by date added to the databse. Current sorting is by date posted on LinkedIn. Some jobs (~1-5%) are not being picked up by the search (and as such this scraper) until days after they are posted. This is a known issue with LinkedIn and there's nothing I can do about it, however sorting jobs by dated added to the database will make it easier to find those jobs.
- Add front end functionality to configure search, and execute that search from UI. Currently configuration is done in json file and search is executed from command line.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License.X Write README.md file for this project. Make it detailed as possible. X