-
Notifications
You must be signed in to change notification settings - Fork 125
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6ae7b7a
commit 928c949
Showing
11 changed files
with
6,230 additions
and
0 deletions.
There are no files selected for viewing
44 changes: 44 additions & 0 deletions
44
...hon-and-sql/34_web_scraping_using_beautifulsoup/01_web_scraping_using_beautifulsoup.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"pycharm": { | ||
"name": "#%% md" | ||
} | ||
}, | ||
"source": [ | ||
"# Web Scraping using Beautiful Soup\n", | ||
"\n", | ||
"* Problem Statement\n", | ||
"* Installing Pre-requisites\n", | ||
"* Overview of BeautifulSoup\n", | ||
"* Getting HTML Content\n", | ||
"* Processing HTML Content\n", | ||
"* Creating Data Frame\n", | ||
"* Processing Data using Data Frame APIs" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
169 changes: 169 additions & 0 deletions
169
.../01-python-and-sql/34_web_scraping_using_beautifulsoup/02_installing_pre-requisites.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Installing Pre-requisites\n", | ||
"\n", | ||
"Let us install prerequisites to take care of web scraping using Python. We will use Python libraries such as `requests`, `BeautifulSoup` and optionally `pandas` to perform Web Scraping and then process the data.\n", | ||
"* Library to get the content from HTML Pages `requests`\n", | ||
"* Process HTML Tags and extract data using APIs provided by`beautifulsoup4`\n", | ||
"* Once the data is scraped from HTML pages we can process it by using `pandas` Data Frame APIs. Alternatively, we can also use native collections and associated libraries to process the scraped data.\n", | ||
"\n", | ||
"```shell\n", | ||
"pip install beautifulsoup4\n", | ||
"pip install pandas\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"tags": [ | ||
"remove-cell" | ||
] | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/hj6yg_8fLpo?rel=0&controls=1&showinfo=0\" frameborder=\"0\" allowfullscreen></iframe>\n" | ||
], | ||
"text/plain": [ | ||
"<IPython.core.display.HTML object>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"%%HTML\n", | ||
"<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/hj6yg_8fLpo?rel=0&controls=1&showinfo=0\" frameborder=\"0\" allowfullscreen></iframe>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Name: beautifulsoup4\n", | ||
"Version: 4.9.3\n", | ||
"Summary: Screen-scraping library\n", | ||
"Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/\n", | ||
"Author: Leonard Richardson\n", | ||
"Author-email: [email protected]\n", | ||
"License: MIT\n", | ||
"Location: /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages\n", | ||
"Requires: soupsieve\n", | ||
"Required-by: sphinx-book-theme\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pip show beautifulsoup4" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Name: pandas\n", | ||
"Version: 1.1.5\n", | ||
"Summary: Powerful data structures for data analysis, time series, and statistics\n", | ||
"Home-page: https://pandas.pydata.org\n", | ||
"Author: None\n", | ||
"Author-email: None\n", | ||
"License: BSD\n", | ||
"Location: /home/itversity/.local/lib/python3.6/site-packages\n", | ||
"Requires: pytz, numpy, python-dateutil\n", | ||
"Required-by: beakerx\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pip show pandas" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Defaulting to user installation because normal site-packages is not writeable\n", | ||
"Requirement already satisfied: beautifulsoup4==4.9.3 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (4.9.3)\n", | ||
"Requirement already satisfied: soupsieve>1.2; python_version >= \"3.0\" in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from beautifulsoup4==4.9.3) (2.1)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pip install beautifulsoup4==4.9.3" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Defaulting to user installation because normal site-packages is not writeable\n", | ||
"Requirement already satisfied: pandas==1.1.5 in /home/itversity/.local/lib/python3.6/site-packages (1.1.5)\n", | ||
"Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas==1.1.5) (2020.4)\n", | ||
"Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas==1.1.5) (2.8.1)\n", | ||
"Requirement already satisfied: numpy>=1.15.4 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas==1.1.5) (1.19.4)\n", | ||
"Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas==1.1.5) (1.15.0)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!pip install pandas==1.1.5" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.