okadabooks_scraper

This scraper is used to remove details from all available bookcards (a summary of each book) in the website https://okadabooks.com. Okadabooks is a popular online store for african literature based in Nigeria. Since the company was founded in 2013, it has helped increase the reading culture in the country and pioneered a lot of writing initiatives. For more info, check the website. Here is what one of the bookcards look like. Highlighted are some of the details that will be scrapped in this project.

Full list of what will be scraped are:

1)Title

2)Author

3)Genre

4)Price

5)Reads

6)Ratings

7)Blurb (the description)

8)Booklink

Dynamic content

When any kind of interaction with the webpage is required during webscraping, it is no longer a static one but becomes a dynamic one. Here, selenium comes to the rescue. Selenium is used in this project because automation is required. There are 22 categories in the website and in each category there is a LoadMore button. When the loadMore button is tapped on by a user, it reveals more content. To be able to get all bookcards from the website, The load more button must be pressed till it is no longer available. The crawling algorithm must imitate user interaction in order to avoid being blocked and to also make sure too many requests are not sent to the website at once. The extracted data set will be stored in this repository. The data extracted will be used for some EDA which will be posted in this repository.

LoadMore Button

Here are pictures showing the beginning and end products of the webpages during scraping. The contents of each book are then gotten when the loadMore Button is no longer there.

Before scrapping:

There are different types of pagination which include infinite scrolling (common in Twitter), serial pagination and a lot more. In this project, LoadMore pagination is used where a tap triggers more content. Here Javascript is used, that why beautifulSoup library used alone won't work. Action class click in selenium helps to scroll down to the end of the webpage after webelement click has been used. In the simplest terms, The idea is for an automated software (in this case chrome driver) to tap the loadmore button in between seconds (used as a buffer to imitate a user and to avoid overloading the website with requests) continously till it no longer exists.

After scraping:

Snapshot of dataframe

After scraping, cleaning and getting rid of all duplicates in the bookcards gotten, around 18,000 books were left. This excluded books that had been taken down by the admin/author after publication. A similarity metric will be calculated with the millions of words in the blurb column and some EDA will be done to understand the reading pattern of Nigerians and at the same time ascertain the influence of the bookstore in our reading ecosystem.

Here, is what the final table looks like:

For more data analysis on the data gotten (esample, NLP project), please check my portfolio.

Happy coding!!!

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
okadabooks_pictures		okadabooks_pictures
README.md		README.md
okadabooks.csv		okadabooks.csv
okadabooks_scraper .ipynb		okadabooks_scraper .ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

okadabooks_scraper

Dynamic content

LoadMore Button

Snapshot of dataframe

About

Releases

Packages

Languages

oaofili-forks/okadabooks_scraper

Folders and files

Latest commit

History

Repository files navigation

okadabooks_scraper

Dynamic content

LoadMore Button

Snapshot of dataframe

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages