Skip to content

yong2khoo-lm/Mini-Project-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 

Repository files navigation

Description

  1. LinkedIn as one of the most popular platforms for candidate searching.
  2. Market demand on Data Scientist expertise is increasing, demand is greater than supply
  3. Thus, potential candidate overview and speed in identify the correct candidate is crucial in this talent war
  4. With this, we have the idea to scrap the LinkedIn profile, focus on Data Scientist.

Approach

  1. Web scraping on LinkedIn with Selenium. (As one requires login in order to view the candidates' profile)
  2. Focus on Data Scientist.
  3. Identify the Url to get the Candidate list.
  4. Scrape the candidate urls from candidate list.
  5. Scrape the candidate profile to get the information, such as name, location, experiences, education and skills.

Analysis and Visualization

  1. Candidates by Location - To understand the candidate origin location and thus define hiring strategy. Pie chart gives good overview on the most dense Data Scientist location

  1. Candidates Skill sets - To get insight of the candidate skills, which is crucial in the job roles.

  1. Total Skills Histogram - Have an understanding of how many skills the candidate input to their profiles.

  1. Candidates Education Word Cloud - To get a high level understanding of the distribution of the candidate education background.

  1. Table - Show the raw scraped data

Hosting

  1. With Digital Ocean, at here.
  2. Login: username: test, password: abababab

Challenges

Web scraping

  1. LinkedIn filed lawsuit on company scraping its site.
  2. Throughout the scraping period, two of my linkedon accounts are suspected to have unusual activities. Then, I have to scrape at a low frequency, around 10+ candidates or around 20 web pages per hour.
  3. XPath is dynamic in LinkedIn site, so CSS Selectors are chosen as the approach.
  4. Pypi package Linked Scraper doesn't work. Have to write the script from scratch.

Django

  1. Use plotly to plot graph. As it is relatively easier to render at html
  2. Use a js wordcloud lib instead of from python.
  3. Great to have found the Atlantis template, which features User Registration and Login.

Project Requirements

  • Scrape data
  • Clean it, document it, visualize it
  • Project runnable at local
  • Push to Github with proper commit messages
  • Host it on Digital Ocean

Enhancement

  1. To meet the original objective, aka, hunting for data scientist from LinkedIn, it is better to display the results in a tabular format, with searching feature.