Build image datasets from the command line.
- Allows the user to build multiclass image datasets by taking a
class name
and aGoogle search term
that scrapes images on Google for that class. - Allows the user to select the minimum and maximum number of scraped images to be considered for a class. If more images are scraped than the maximum limit, the extra images are removed. If lesser images are scraped than the minimum limit, the class is discarded. Otherwise, whatever images are found will be split into train and test.
- To obtain a fixed number of images for each class (i.e. 100 each, 200 each), just ensure that the
minimum
andmaximum
limits have the same value (as the number of images desired.) - Allows the user to enter a split ratio (i.e. 0.75 for 75% train, 25% test or 0.8 for 80% train, 20% test) according to which all scraped, valid images are split and arranged into folders.
The scraped images are downloaded and arranged in a structure similar to this. They can be suitably paired with Tensorflow's ImageDataGenerator to load datasets from the directory on the fly.
Depending on the size of classes specified, the script takes a couple minutes to run, download and arrange images. When the process is finished, you can find the results in the data
folder with train
and test
folders in it, each containing subfolders for each class with images in it.
Note: currently, it can extract around 300 to 400 images for a class, or even go above 500 for common classes like fruits, animals etc.
You must have Python 3.6 or higher to run the file.
- Create a new virtual environment for running the application. You can follow the instructions here.
- Navigate to the virtual environment and activate it.
- Install the dependancies using
pip install -r requirements.txt
- Replace the folder for
chromedriver_win32
with a relevant version which matches your browser. - Run the
scraper.py
file withpython scraper.py
- Expanding the functionality to other image search engines like Yahoo, Bing for a larger dataset.
- Add an option for a validation set to be built using a validation split.