Google, Naver multiprocess image crawler (High Quality & Speed & Customizable)
-
Install Chrome
-
pip install -r requirements.txt
-
Write search keywords in keywords.txt
-
Run "main.py"
-
Files will be downloaded to 'download' directory.
usage:
python3 main.py [--skip true] [--threads 4] [--google true] [--naver true] [--full false] [--face false] [--no_gui auto] [--limit 0]
--skip true Skips keyword if downloaded directory already exists. This is needed when re-downloading.
--threads 4 Number of threads to download.
--google true Download from google.com (boolean)
--naver true Download from naver.com (boolean)
--full false Download full resolution image instead of thumbnails (slow)
--face false Face search mode
--no_gui auto No GUI mode. (headless mode) Acceleration for full_resolution mode, but unstable on thumbnail mode.
Default: "auto" - false if full=false, true if full=true
(can be used for docker linux system)
--limit 0 Maximum count of images to download per site. (0: infinite)
--proxy-list '' The comma separated proxy list like: "socks://127.0.0.1:1080,http://127.0.0.1:1081".
Every thread will randomly choose one from the list.
You can download full resolution image of JPG, GIF, PNG files by specifying --full true
Detects data imbalance based on number of files.
When crawling ends, the message show you what directory has under 50% of average files.
I recommend you to remove those directories and re-download.
sudo apt-get install xvfb <- This is virtual display
sudo apt-get install screen <- This will allow you to close SSH terminal while running.
screen -S s1
Xvfb :99 -ac & DISPLAY=:99 python3 main.py
You can make your own crawler by changing collect_links.py
As google site consistently changes, please make issues if it doesn't work.
Google XPath element can be changed sometimes, please check it before run. For instance, the line of code in collect_links.py:
photo_grid_boxes = self.browser.find_elements(By.XPATH, '//div[@class="isv-r PNCib ViTmJb BUooTd"]')