- LinkedIn as one of the most popular platforms for candidate searching.
- Market demand on Data Scientist expertise is increasing, demand is greater than supply
- Thus, potential candidate overview and speed in identify the correct candidate is crucial in this talent war
- With this, we have the idea to scrap the LinkedIn profile, focus on Data Scientist.
- Web scraping on LinkedIn with Selenium. (As one requires login in order to view the candidates' profile)
- Focus on Data Scientist.
- Identify the Url to get the Candidate list.
- Scrape the candidate urls from candidate list.
- Scrape the candidate profile to get the information, such as name, location, experiences, education and skills.
- Candidates by Location - To understand the candidate origin location and thus define hiring strategy. Pie chart gives good overview on the most dense Data Scientist location
- Candidates Skill sets - To get insight of the candidate skills, which is crucial in the job roles.
- Total Skills Histogram - Have an understanding of how many skills the candidate input to their profiles.
- Candidates Education Word Cloud - To get a high level understanding of the distribution of the candidate education background.
- Table - Show the raw scraped data
- With Digital Ocean, at here.
- Login: username:
test
, password:abababab
- LinkedIn filed lawsuit on company scraping its site.
- Throughout the scraping period, two of my linkedon accounts are suspected to have unusual activities. Then, I have to scrape at a low frequency, around 10+ candidates or around 20 web pages per hour.
- XPath is dynamic in LinkedIn site, so CSS Selectors are chosen as the approach.
- Pypi package Linked Scraper doesn't work. Have to write the script from scratch.
- Use plotly to plot graph. As it is relatively easier to render at html
- Use a js wordcloud lib instead of from python.
- Great to have found the Atlantis template, which features User Registration and Login.
- Scrape data
- Clean it, document it, visualize it
- Project runnable at local
- Push to Github with proper commit messages
- Host it on Digital Ocean
- To meet the original objective, aka, hunting for data scientist from LinkedIn, it is better to display the results in a tabular format, with searching feature.