Skip to content

GHfangxin/spiderq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is spiderq?

Spiderq is a Web Spider to crawl webpage(html) by Qteqpid. The performance depends on your server configuration and network. I will continue maintain it and list some TODOs at the end of this file. More people are welcome to join!

Building spiderq

Spiderq can be compiled and used on Centos 5.8 . It is as simple as:

% make
% make install

Then you will get an executable file named spider. After configurating spiderq.conf, run program:

% ./spider

For more informations, see Makefile.

Contact

For any question, just contact me at any time. Enjoy! mailto: qteqpid[email protected] blog: http://hi.baidu.com/qteqpid_pku

TODO

@线程池 @信号处理 @网页内容排重 @同一ip间隔抓取 @层次结构存储网页 @是否遵守robots.txt @支持更新抓取,不重复抓 @定义对外api和html类,方便用户自定义处理html,动态加载方式

About

web spider

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 91.4%
  • C 5.3%
  • Makefile 3.3%