Skip to content

Commit 112ba7a

Browse files
authoredFeb 23, 2024
Update README.md
1 parent 55e25c4 commit 112ba7a

File tree

1 file changed

+203
-115
lines changed

1 file changed

+203
-115
lines changed
 

‎README.md

+203-115
Original file line numberDiff line numberDiff line change
@@ -2,208 +2,296 @@
22
<a href="https://dashboard.smartproxy.com/?page=residential-proxies&utm_source=socialorganic&utm_medium=social&utm_campaign=resi_trial_GITHUB"><img src="https://i.imgur.com/3uZgYJ9.png"></a>
33
</p>
44

5-
### Disclaimer
5+
<p align="center">
6+
<a href="https://github.com/Smartproxy/Smartproxy"> :house: Main Repository :house: </a>
7+
</p>
68

7-
The following tutorial is meant for educational purposes and introduces to the basics of web scraping and utilizing Smartproxy for it.
8-
We suggest to reseach the [Requests](https://requests.readthedocs.io/en/master/user/quickstart/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) documentation in order to build upon the given example.
9+
## Table of contents
910

10-
### Prerequisites
11+
- [Disclaimer](#disclaimer)
12+
- [What is web scraping with Python?](#what-is-web-scraping-with-python)
13+
- [Prerequisites](#prerequisites)
14+
- [Installation](#installation)
15+
- [Be polite](#be-polite)
16+
- [How to build a web scraper?](#how-to-build-a-web-scraper)
17+
- [Inspecting the site](#inspecting-the-site)
18+
- [Requesting and parsing the data](#requesting-and-parsing-the-data)
19+
- [Extracting the data](#extracting-the-data)
20+
- [Conclusion](#conclusion)
1121

12-
To run our example scraper, you are going to need these libraries:
22+
## Disclaimer
1323

14-
* [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)
15-
* [Requests](https://pypi.org/project/requests/)
24+
The following tutorial is meant for educational purposes and introduces the basics of building a web scraping project using Smartproxy proxies. You can read more about the [Requests](https://requests.readthedocs.io/en/master/user/quickstart/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries in their documentation to learn more about them and build upon this example.
1625

17-
### Installation
26+
## What is web scraping with Python?
1827

19-
To install the scraper example run:
28+
If you're here, you're probably interested in learning how to scrape valuable data from the web. However, before you dive into it, let's first understand what web scraping is.
2029

21-
`git clone https://github.com/Smartproxy/Python-scraper-tutorial.git`
30+
Scraping is an automated process of acquiring a web page with all its content and extracting selected information from it for further processing. The most common purpose of scraping is to avoid the hassle of doing it manually and efficiently, gathering large amounts of data in seconds. Examples include gathering reviews, prices, weather reports, billboard hits, etc.
2231

23-
or
32+
## Prerequisites
2433

25-
`curl https://raw.githubusercontent.com/Smartproxy/Python-scraper-tutorial/master/scraper.py > scraper.py`
34+
To run the example scraper, you're going to need [Python](https://www.python.org/downloads/) together with these libraries installed:
2635

27-
## List of contents
36+
* [Requests](https://pypi.org/project/requests/)
37+
* [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)
2838

29-
- [Introduction](#introduction)
30-
- [Be polite](#be-polite)
31-
- [Let's get to it](#lets-get-to-it)
32-
- [Inspecting the site](#inspecting-the-site)
33-
- [Requesting and parsing the data](#requesting-and-parsing-the-data)
34-
- [Extracting the data](#extracting-the-data)
35-
- [Conclusion](#conclusion)
3639

37-
## Introduction
40+
## Installation
41+
42+
To install the scraper example, run the following:
43+
44+
```
45+
git clone https://github.com/Smartproxy/Python-scraper-tutorial.git
46+
```
47+
48+
or
3849

39-
If you’re here that means you are interested in finding out more about how to scrape and enjoy all the data that you gather. However, before we dive into it, we first need to understand what web scraping is. In general terms, scraping is the process of acquiring a web page with all of its information and then extracting selected fields for further processing. Usually the purpose of gathering that information is so that a person could easily monitor it. Some examples could be reviews, prices, weather reports, billboard hits,and so on.
50+
```
51+
curl https://raw.githubusercontent.com/Smartproxy/Python-scraper-tutorial/master/scraper.py > scraper.py
52+
```
4053

4154
## Be polite
4255

43-
Just as you are polite and caring in the real world, you should be such online as well. Before you start scraping, make sure that the website you’re targeting allows it. You can do that by checking its **Robots.txt** file. If the site doesn’t condone crawling or scraping of its content, be kind and respect the owner’s wishes. Failing to do so might get your IP blocked or even lead to legal action taken against you, so be wary. Moreover, check if the site you’re targeting has an API. If it does, just use that – it will be easier to get the needed data, and you won’t put unnecessary load on the sites infrastructures.
56+
Just as you are polite and respectful in the real world, you should also be like that online. Before you start scraping, ensure the website you're targeting allows it. You can do that by checking its **Robots.txt** file. If the site doesn't condone crawling or scraping of its content, be kind and respect the rules. Failing to do so might get your IP blocked, rate-limited, or even lead to legal action against you. Moreover, check if the site you're targeting has an API. If it does, use it – getting the needed data will be more straightforward, and you won't put unnecessary load on the website's server.
57+
58+
## How to build a web scraper?
4459

45-
## Let’s get to it
60+
In the following tutorial, you'll learn not only how to write a basic scraper but also how to modify the code according to your own needs. Moreover, you'll also learn how to set it up together with proxies to ensure total anonymity when web scraping.
4661

47-
In the following tutorial, you will not only see how a basic scraper is written but will also learn how to adjust it to your own needs. Moreover, you will learn how to do it via a proxy!
62+
As mentioned before, you'll be using these libraries:
63+
* [Requests](https://pypi.org/project/requests/)
64+
* [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)
4865

49-
As mentioned, we will be using these libraries:
50-
Requests
51-
BeautifulSoup 4
52-
The page we’re going to scrape is http://books.toscrape.com/. It doesn’t have robots.txt, but I think we can agree that the name of the site is asking you to scrape it. But before we carry on with the coding part, let's inspect the website first.
53-
First off, let’s import the libraries we’ll be using:
66+
The page we're going to scrape is http://books.toscrape.com/. It is a website built specifically to test your web scraping scripts. Before going into the example code part, let's inspect the website first.
5467

5568
### Inspecting the site
5669

57-
So, this is what the main page of the website looks like. We can see it contains books, their titles, prices, ratings, availability information, and a list of genres in the sidebar.
70+
The following screenshot shows what the main page of the website looks like. You can see it contains books, their titles, prices, ratings, availability information, and a list of genres in the sidebar.
5871

5972
<p align="center">
60-
<img src="https://i.imgur.com/ovjMkS6.png" alt="books.toscrape.com Main window" width="600" height="500">
73+
<img src="https://i.imgur.com/ovjMkS6.png" alt="books.toscrape.com Main window">
6174
</p>
6275

63-
When we select a specific book, we are greeted with even more information, such as its description, how many are in stock, the number of reviews, etc.
76+
When you select a specific book, you're taken to a page with even more information, such as its description, how many are in stock, the number of reviews, etc.
6477

6578
<p align="center">
66-
<img src="https://i.imgur.com/YRy5a1r.png" alt="books.toscrape.com Article window" width="600" height="500">
79+
<img src="https://i.imgur.com/YRy5a1r.png" alt="books.toscrape.com Article window">
6780
</p>
6881

69-
Great! Now we can think about what information we’d like to extract from this site. Generally, when scraping, we want to get valuable information which we could use later on. In this example, the most important points would be the price and the title of the book, so we could, for example, make a comparison with books on another website. We can also extract the direct link to a book, so it would be easier to reach later on. Finally, it would be great to know if the book is even available. As a finishing touch, we can scrape its description as well – perhaps it might catch your eye and you’ll read it.
82+
The next step is to decide what information you'd like to extract from this site. Generally, you'll want valuable information you could use later as data. There are many use cases, such as extracting the title and price to compare it to another website, getting direct links to books, checking if they're available, or even getting the descriptions to create an easy-to-read list in a spreadsheet.
7083

71-
So, now that we know exactly what we want to get from the site, we can go on and inspect those elements to see where they can be found later. Just a note: you don’t need to memorise everything now; when scraping, you’ll have to go back to the HTML code numerous times. Let’s have a look at the site’s code and inspect the elements we need. To do so, just right-click anywhere on the site with your mouse and select Inspect”.
84+
Once you know exactly what you want from the site, you can inspect those elements to see how to target them. Look at the site's HTML structure and inspect the elements you need. If you're using Google Chrome, right-click anywhere on the site with your mouse and select **Inspect**. Other web browsers will have an equivalent option to do this, too.
7285

73-
Once you do that, a gazillion things will open – but don’t worry, we don't need to go through all of them. After a quick inspection, we can see that all the data we need is located in the article element with a class name **product_pod**. The same is for all other books, as well.
86+
The Chrome DevTools will open and display the HTML structure of the page. You can manually search for the item you need or use the element picker tool in the top-left corner. Select it, hover over the item you need in the page and it'll find it in the HTML code. After a quick inspection, you can see that the main information on each book is located in the article element with a class name **product_pod**.
7487

7588
<p align="center">
76-
<img src="https://i.imgur.com/QbdDzyW.png" alt="books.toscrape.com Inspecting the HTML" width="1000" height="500">
89+
<img src="https://i.imgur.com/QbdDzyW.png" alt="books.toscrape.com Inspecting the HTML">
7790
</p>
7891

79-
This means that all the data we need will be nested in that article element. Now, lets inspect the price. We can see that the price value is the text of the **price_color** paragraph. And if we inspect In stock, we can see that it is a text value of the **instock availability** paragraph. Now go on and get familiar with the rest of the elements we’ll be scraping. Once you're done, we need to get coding and turn our data extraction wishes into reality.
92+
All of the data you'll need is nested in the **article** element. Now, let's inspect the price. We can see that the price value is the text of the paragraph with the **price_color** class. If you inspect the In stock part, you can see that it's a text value of the **instock availability** paragraph. You can check out other elements on the page and see how they're represented in the HTML. Once you're done, let's build a simple web scraper to extract this data through code.
8093

8194
### Requesting and parsing the data
8295

83-
First off, let’s import the libraries we’ll be using:
84-
<p align="center">
85-
<img src="https://i.imgur.com/eIpTBBJ.png" alt="books.toscrape.com Libraries" width="600" height="50">
86-
</p>
96+
This code aims to write a script that gets the title, price, availability, description, and link to each book and prints it out in a nice, readable format.
8797

88-
We’ll need the **Requests** library to send HTTP requests and **BeautifulSoup** to parse the responses we receive from the website. Go ahead and import them.
89-
Then, we’ll need to write a GET request to retrieve the contents of the site. Lets assign the response to the variable **r**.
98+
Begin writing the script by creating a Python file (.py) in your desired directory. Open the file in a code editor. Your first few lines should import the libraries you'll be using:
9099

91-
<p align="center">
92-
<img src="https://i.imgur.com/7kAQtEY.png" alt="Writing the request" width="600" height="50">
93-
</p>
100+
```python
101+
import requests
102+
from bs4 import BeautifulSoup
103+
```
94104

95-
The **requests.get** function has only one required argument, which is the URL of the site you are targeting. However, because we wish to use a proxy to reach the content, we need to pass in an additional proxies parameter. As you can see, both values are already assigned to variables, so let’s have a look at them.
105+
You'll need the **Requests** library to send HTTP requests and **BeautifulSoup** to parse the responses you receive from the website. Go ahead and import them.
96106

97-
<p align="center">
98-
<img src="https://i.imgur.com/KiS9NSP.png" alt="Request's variables" width="600" height="50">
99-
</p>
107+
Then, you'll need to write a GET request to retrieve the contents of the site. Assign the response to the variable ```r```.
100108

101-
For the proxy, we first need to specify its kind, in this case, HTTP. Then, we have to enter the Smartproxy user’s username and password, separated by a colon, as well as the endpoint which we’ll be using to connect to the proxy server. And, well, the **url** is the address of the site we wish to scrape.
109+
The ```requests.get``` function has only one required argument: the URL of the site you're targeting. However, you must pass in an additional proxy parameter because you'll want to use a proxy to reach the content. Declare these variables above your ```requests.get``` statement.
102110

103-
At the moment, the variable **r** holds the full response data from the website, including the status code, headers, URL itself, and, most importantly, the content we need. You can print it out with **print(r.content)**, and you’ll see that it’s the HTML code of the site you inspected previously. However, this time it’s on your device! (Except that it’s awkwardly formatted and unreadable, but we’ll fix that.) https://i.imgur.com/fzV4P8D.png
111+
```python
112+
proxy = {'http': 'http://username:password@gate.smartproxy.com:10000'}
113+
url = 'http://books.toscrape.com/'
114+
r = requests.get(url, proxies=proxy)
115+
```
104116

105-
To start working with the HTML code, we first need to parse it with BeautifulSoup – make a parse tree which we can use to extract the necessary information. Let’s create a variable called **html**. We’ll use it to store the parsed **r.content**. To parse the HTML code, we just need to call the BeautifulSoup class and pass in the content and ‘html.parser’ (‘cause, you know, we are parsing HTML content here) as arguments. Try printing it out!
117+
For the proxy, you first need to specify its kind, in this case, HTTP. Then, you have to enter the Smartproxy username and password, separated by a colon, and the endpoint you'll be using to connect to the proxy server. In this example, we're using residential proxies. You can get this information from the dashboard by following these steps:
118+
1. Open the proxy setup tab.
119+
2. Navigate to the Endpoint generator.
120+
3. Configure the parameters according to your needs. Set your authentication method, location, session type, and protocol.
121+
4. Select the number of proxy endpoints you want to generate (you'll only need 1 for now).
122+
5. Copy the endpoint(s).
106123

107124
<p align="center">
108-
<img src="https://i.imgur.com/fzV4P8D.png" alt="Parsing the HTML" width="600" height="50">
125+
<a href="https://smartproxy.com/"><img src="https://i.imgur.com/M2J00E4.png"></a>
109126
</p>
110127

111-
If you noticed in the image above, I used prettify(). It’s a method that comes with BeautifulSoup, and it makes the HTML even more understandable by adding indents and things like that.
128+
The ```url``` parameter is simply the address of the site you want to scrape.
112129

113-
### Extracting the data
130+
Currently, the variable ```r``` holds the entire response data from the website, including the status code, headers, URL itself, and the content you need. You can add the following line to print the content:
131+
```python
132+
print(r.content)
133+
```
114134

115-
As we found out earlier, all the data we need can be found in the **product_pod** articles. So, to make our lives easier, we can collect and work only with them. This way, we won’t need to parse all of the site’s HTML each time we want to get any data about a book. To do so, we can use one of BeautifulSoup’s methods called **find_all()**; it will find all instances of specified content.
135+
The code so far should look like this:
136+
```python
137+
import requests
138+
from bs4 import BeautifulSoup
139+
proxy = {'http': 'http://username:password@gate.smartproxy.com:10000'}
140+
url = 'http://books.toscrape.com/'
141+
r = requests.get(url, proxies=proxy)
142+
print(r.content)
143+
```
144+
Run the code in your Terminal with:
145+
```
146+
python file_name.py
147+
```
116148

117-
So, in our case, we need to find and assign all **articles** with the **product_pod** class to a variable. Let’s call it **all_books**. Now we need to parse through the html variable which we created earlier and which holds the entire HTML of the site. We’ll use the **find_all()** method to do so. As arguments for the **find_all()** method, we need to pass in two attributes: ‘article’, which is the tag of the content, and the class **product_pod**. Please note that because class is a Python keyword, it can’t be used as an argument name, so you need to add a trailing underscore. Here’s how it should look like:
149+
You'll see that it's the site's HTML code you inspected previously. Unfortunately, this data is presented in raw format and is hard to read and understand. The script needs to be improved by parsing only the relevant data you need.
118150

119-
<p align="center">
120-
<img src="https://i.imgur.com/io8Kr21.png" alt="Assigning book data to a variable" width="600" height="50">
121-
</p>
151+
A simple first step in cleaning up our data is to parse HTML with BeautifulSoup. Let's create a variable called ```html```. It will be used to store the parsed ```r.content```. To parse the HTML, you simply need to call the BeautifulSoup class and pass in the content and ```html.parser``` as arguments. Try printing it out!
122152

123-
Now, if you print out **all_books**, you’ll see that it contains a list of all the ‘product_pod’ articles found in the page.
153+
```python
154+
html = BeautifulSoup(r.content, 'html.parser')
155+
print(html.prettify())
156+
```
124157

125-
We’ve narrowed down the HTML to as much as we need. Now we can start gathering data about the books. Because **all_books** is a list containing all the necessary information about each book in the page, we’ll need to cycle through it using a **for** loop. Like this:
158+
It's an optional step, but you can also add ```prettify()```. It's a method that comes with BeautifulSoup, and it makes the HTML even more understandable by adding indentation and presenting the data in a more human-readable format.
126159

127-
##### for book in all_books:
160+
### Extracting the data
128161

129-
**book** is just a variable we created which we’ll be calling to get specific information in each loop. You can name it however you wish, but in our case, **book** is exactly what we are working with each iteration of the loop in the **all_books** list. Remember that we want to find the title, price, availability, description, and the link to each book. Let’s get started!
162+
As you've learned earlier, all the data we need can be found in the **article** elements with the **product_pod** class. To make things easier, you can modify the code to only collect data from these elements specifically. This way, you won't need to parse all of the site's HTML whenever you want to get data about a book. To do so, use one of BeautifulSoup's methods called **find_all()**; it will find all instances of a specified content.
130163

131-
When inspecting the site, we can see that the title is located in the h3 element, which is the only one in the **product_pod** article we’re working with.
164+
Find and assign all **article** elements with the **product_pod** class to a variable. Let's call it ```all_books```. You need to parse through the ```html``` variable that holds all the HTML with the ```find_all()``` method. As arguments, pass in two attributes: **article**, which is the tag of the content, and the class **product_pod**. Note that because ```class``` is a Python keyword, it can't be used as an argument name, so you need to add a trailing underscore. Here's how it should look like:
132165

133-
<p align="center">
134-
<img src="https://i.imgur.com/odjbJLJ.png" alt="Inspecting the title" width="600" height="500">
135-
</p>
166+
```python
167+
all_books = html.find_all('article', class_='product_pod')
168+
```
136169

137-
BeautifulSoup allows you to find a specific element very easily, by just specifying the HTML tags. To find the article, all we need to write is this:
170+
If you print out ```all_books```, you'll see it contains a list of all the ‘product_pod' articles found on the page.
138171

139-
<p align="center">
140-
<img src="https://i.imgur.com/OeRHGkb.png" alt="Assigning the title" width="600" height="50">
141-
</p>
172+
You've successfully narrowed down the HTML to as much as you need. Now you can start scraping data about the books. Because **all_books** is a list containing all of the necessary information about each book on the page, you'll need to cycle through it using a ```for``` loop:
142173

143-
Once again, **book** is just the current iteration of the **product_pod** article, so we can just add **.h3** and **.a** to specify which component’s data we want to assign to the title. We could also add **.text** to get the text value of the **book.h3.a.** – which is indeed the title, but if you noticed, longer titles are not complete and have “...” at the end for styling purposes. That’s not really what we need. Instead, we need to get the value of the title element, which can be done by just adding **‘title’** in the square brackets.
174+
```python
175+
for book in all_books:
176+
# Do something
177+
```
144178

145-
If we run **print(title)**, you’ll see that we have successfully extracted all the titles of the books in the page.
179+
```book``` is a variable that you'll call to get specific information in each loop. You can name it however you want, but in our case, ```book``` makes the most sense as you're working with book information in each iteration of the loop of the ```all_books``` list. Remember that we want to find the title, price, availability, description, and link to each book. Let's get started!
180+
181+
When inspecting the site, you can see that the **title** is located as an attribute within the **a** element under **h3**, which is the only one in the **product_pod** scope you're working with.
146182

147183
<p align="center">
148-
<img src="https://i.imgur.com/TpCaTPJ.png" alt="Title variable output" width="600" height="500">
184+
<img src="https://i.imgur.com/odjbJLJ.png" alt="Inspecting the title">
149185
</p>
150186

151-
Some objects are not as easy to extract. They may be located in paragraphs, nested in other paragraphs further nested in other div containers. In such cases, it’s easier to use the **find()** method. It’s very similar to the **find_all()** method, however, it only returns the first found element – in our case, that’s exactly what we need. To find the price, we want to find the paragraph with the **price_color** class and extract its text.
187+
BeautifulSoup allows you to find a specific element very easily, by just specifying the HTML tags. To find the article, write the following:
152188

153-
<p align="center">
154-
<img src="https://i.imgur.com/QSQ9RyR.png" alt="Assigning the price" width="600" height="50">
155-
</p>
189+
```python
190+
title = book.h3.a['title']
191+
```
156192

157-
To find out if the element is in stock, we need to do the same thing we did with the price, simply specify a different paragraph. That would be the one containing the **instock availability** class. If you were to print out the availability just like that, you’d see a lot of blank lines. It’s just the way the site’s HTML is styled. To combat that, we can use a simple Python method called **strip()**, which will remove any blank spaces or lines from the string. If you’ve done everything correctly, it should look like this:
193+
```book``` is the current iteration of the **product_pod** article, so simply write a path to the information from the parent element ```.h3``` to the ```.a``` to specify which component's data you want to assign to the title. You can also add ```.text``` to get the only the text value of the ```book.h3.a.``` – which is also the title, but longer titles are incomplete and have "..." at the end for styling purposes. Instead, we need to get the value of the title attribute, which can be done by adding ```'title'``` in the square brackets.
158194

159-
<p align="center">
160-
<img src="https://i.imgur.com/8cKQuyN.png" alt="Assigning the availability" width="600" height="50">
161-
</p>
195+
If you ```print(title)``` in the loop, you'll see that you've successfully extracted all of the titles of the books on the page. This is what the result should look like:
162196

163-
Furthermore, we need to get the description of the book. The problem is that it’s located on another page dedicated to the specific book. First, we need to get the link to the said book and make another HTTP request to retrieve the description. While inspecting, you’ll see that the link occupies the same place as the title. You can create a new variable, copy the command you used for the title, and just change the value in the square brackets to ‘href’, as that’s what we’re looking for there.
197+
```
198+
A Light in the Attic
199+
Tipping the Velvet
200+
Soumission
201+
Sharp Objects
202+
Sapiens: A Brief History of Humankind
203+
The Requiem Red
204+
The Dirty Little Secrets of Getting Your Dream Job
205+
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
206+
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
207+
The Black Maria
208+
Starving Hearts (Triangular Trade Trilogy, #1)
209+
Shakespeare's Sonnets
210+
Set Me Free
211+
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
212+
Rip it Up and Start Again
213+
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
214+
Olio
215+
Mesaerion: The Best Science Fiction Stories 1800-1849
216+
Libertarianism for Beginners
217+
It's Only the Himalayas
218+
```
164219

165-
<p align="center">
166-
<img src="https://i.imgur.com/iD0YDZO.png" alt="Assigning the link to a book" width="600" height="50">
167-
</p>
220+
Some objects are not as easy to extract. They may be located in elements nested deep in other elements and containers. In such cases, using the ```find()``` method is easier. It's very similar to the **find_all()** method but only returns the first element found. You'll want to find the ```p``` with the **price_color** class and extract its text to see the price.
168221

169-
But, if you print out **link_to_book**, you’ll see that it contains only a part of the link - the location of where the book can be found on the site, but not the domain. One easy way to solve this is to assign the website’s domain link to a variable and just add the **link_to_book**, like this:
222+
```python
223+
price = book.find('p', class_='price_color').text
224+
```
170225

171-
<p align="center">
172-
<img src="https://i.imgur.com/5XqNMqC.png" alt="Link variable" width="600" height="50">
173-
</p>
174-
Boom! Now you have the complete link, which we can use to extract the book’s description.
226+
Next, to find out if the book is in stock, you'll need to do the same thing you did with the price but specify a different paragraph with the **instock availability** class. If you were to print out the availability just like that, you'd see a lot of empty lines because that's how the site's HTML is styled. To fix it, you can use a simple Python method called ```strip()``` to remove any blank spaces or lines from the string. The entire line should look like this:
175227

176-
To get the description, we need to make another request inside the **for** loop, so we get one for each of the books. Basically, we need to do the same thing we did in the beginning: send a GET request to the link and parse the HTML response with BeautifulSoup.
228+
```python
229+
availability = book.find('p', class_ ='instock availability').text.strip()
230+
```
177231

178-
<p align="center">
179-
<img src="https://i.imgur.com/TmXYIKR.png" alt="Second request" width="600" height="50">
180-
</p>
232+
Lastly, let's get the description of the book. The problem is that it's located on a separate page dedicated to the specific book. To solve this, you'll need to get the link to the said book and make another HTTP request to retrieve the description. If you check the HTML, you'll see that the link is located in the same place as the title. You can create a new variable, copy the command you used for the title, and change the value in the square brackets to ```'href'```.
181233

182-
When inspecting the HTML of a book’s page, we can see that the description is just plain text stored in a paragraph. However, this paragraph is not the first in the **product_page** article and does not have a specified class. If we just try to use **find()** without any additional parameters, it will return the price because it’s the value located in the very first paragraph.
234+
```python
235+
link_to_book = book.h3.a['href']
236+
```
183237

184-
<p align="center">
185-
<img src="https://i.imgur.com/b0SdKSD.png" alt="Inspecting the product description" width="600" height="500">
186-
</p>
238+
If you print the ```link_to_book```, you'll see that it contains only a part of the link – the location of where the book can be found on the site, but not the domain. An easy way to solve this is to assign the website's domain link to a variable and add the ```link_to_book``` as a replacement for the placeholder ```{0}``` like this:
187239

188-
In such a case, when using the **find()** method, we need to state that the paragraph we’re looking for has no class (no sass intended). We can do so by specifying that the **class_** equals none.
240+
```python
241+
link = "http://books.toscrape.com/{0}".format(link_to_book)
242+
```
243+
Now, you have the complete link, which you can use to extract the book's description.
244+
You'll need to make another request inside the ```for``` loop to get it. Simply put, you need to do the same thing as in the beginning: send a GET request to the link and parse the HTML response with BeautifulSoup.
189245

190-
<p align="center">
191-
<img src="https://i.imgur.com/WWFltGp.png" alt="Assigning the description" width="600" height="50">
192-
</p>
193-
And, of course, because we just want to get the text value, we add **.text** at the very end
246+
```python
247+
r2 = requests.get(link, proxies=proxy)
248+
html2 = BeautifulSoup(r2.content, 'html.parser')
249+
```
194250

195-
That’s it! We’ve gathered all the information that we needed. We can now print it all out and check what we’ve got. Just a quick note: because the description might be quite long, you can trim it by adding **[:x]**, where x = number of characters you want to print. Some Python tricks for you!
251+
When inspecting the HTML of a book's page, you can see that the description is just plain text stored in a paragraph. However, this paragraph is not the first in the article with a **product_page** class that doesn't have a specified class. If you try to use the ```find()``` method without additional parameters, it will return the price because it's the value in the first paragraph.
196252

197253
<p align="center">
198-
<img src="https://i.imgur.com/rwnjf5X.png" alt="Printing the variables" width="600" height="200">
254+
<img src="https://i.imgur.com/b0SdKSD.png" alt="Inspecting the product description">
199255
</p>
200256

201-
And the response we get, which is just beautiful:
257+
In this case, when using the ```find()``` method, you must state that the paragraph you're looking for has no class (no sass intended). We can do so by specifying that the ```class_``` equals none. Because you just want to get the text value, add ```.text``` at the very end:
258+
259+
```python
260+
description = html2.find('p', class_='').text
261+
```
262+
263+
That's it! You've gathered all the information that you need. You can now print all of it and check the result. Just a quick note: because the description might be long, you can trim it by adding ```[:x]```, where x = number of characters that the description can't exceed.
264+
265+
```python
266+
print(title)
267+
print(price)
268+
print(availability)
269+
print("{0}...".format(description[:150]))
270+
print(link)
271+
print()
272+
```
273+
274+
This is how every book's information will be represented:
275+
276+
```
277+
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
278+
£52.29
279+
In stock
280+
Scott Pilgrim's life is totally sweet. He's 23 years old, he's in a rockband, he's "between jobs" and he's dating a cute high school girl. Nothing cou...
281+
http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html
282+
```
202283

203-
<p align="center">
204-
<img src="https://i.imgur.com/tsKfXni.png" alt="All variable output" width="800" height="650">
205-
</p>
206284

207285
## Conclusion
208286

209-
To conclude, I would just like to note that there really are a thousand ways to get the data you need by using different functions, loops, and so on. But we sure hope that by the end of this article, you have a better idea of what, when, and how to scrape, and do it with **proxies**!
287+
That's it! It's important to note that there are a thousand ways to get the data you need by using different functions, loops, etc. The one we've explored here is just one of many, and it can be done in many different approaches. Many challenges can arise when dealing with complex HTML structures.
288+
289+
In this article, you've learned how to write a simple scraper script to get information from a website. While the website is just an example that doesn't employ anti-scraping measures, you've also implemented proxies, ensuring that every request was made anonymously and safely. This knowledge will be beneficial when scraping real websites with actual, valuable data.
290+
291+
292+
## Contact
293+
If you need any help or get stuck, feel free to contact us using one of the methods provided:
294+
295+
Email - sales@smartproxy.com
296+
297+
<a href="https://direct.lc.chat/12092754/">Live chat 24/7</a>

0 commit comments

Comments
 (0)
Please sign in to comment.