A Python spider using Selenium to crawl Facebook user profile information such as first name,last name,work information,education information and etc,and output the information into a csv file.
As we know,the page contents of Facebook are created by many Javascript plugins, thus we can not simply crawl the data using Regex or Scrapy framework.We need to use Selenium to simulate a web browser action and then get data from it. Using Selenium may cost time but it will be the most effective way to crawl from these sites such as Facebook or Taobao.
This project had batter to be run at Eclipse on Win7,but will add support to Ubuntu and let it can run on the Linux terminal later.
- Python2.7
- Selenium 2.42.1
- BeautifulSoup 4.3.2
- urllib2
- A stable VPN account if you are in the mainland China.
- Jdk1.6+
- Eclipse
First,ensure you can access to Facebook freely and quickly,then run the facebookSpider.py to make this project run,then it will login to Facebook automatically and crawl data from the specified URLs one by one.
All the urls are written in the urls.py file.All the configuration items are written in the settings.py file.
Some guys told me that they have a problem when run this application,this is because you have not set the User-Agent
correctly when running.
In the facebookLogin.py,change the User-Agent
data depends on which browser you are using.Only if you have set the correctly value for it,your Selenium can run normaly.
def __init__(self):
'''
Constructor
'''
cookie=cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
# change to User-Agent depends on your own account and browser data,and do not use it directly!
opener.addheaders = [('Referer', 'http://login.facebook.com/login.php'),
('Content-Type', 'application/x-www-form-urlencoded'),
('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 (.NET CLR 3.5.30729)')]
self.opener=opener
The User-Agent
is used to avoid login to Facebook each time when fetch data from Facebook,if you do not know how to set it, just Google!