Skip to content

Commit

Permalink
web scraping content
Browse files Browse the repository at this point in the history
  • Loading branch information
LukeSkywaler97 committed May 17, 2022
1 parent 6ae7b7a commit 928c949
Show file tree
Hide file tree
Showing 11 changed files with 6,230 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"pycharm": {
"name": "#%% md"
}
},
"source": [
"# Web Scraping using Beautiful Soup\n",
"\n",
"* Problem Statement\n",
"* Installing Pre-requisites\n",
"* Overview of BeautifulSoup\n",
"* Getting HTML Content\n",
"* Processing HTML Content\n",
"* Creating Data Frame\n",
"* Processing Data using Data Frame APIs"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing Pre-requisites\n",
"\n",
"Let us install prerequisites to take care of web scraping using Python. We will use Python libraries such as `requests`, `BeautifulSoup` and optionally `pandas` to perform Web Scraping and then process the data.\n",
"* Library to get the content from HTML Pages `requests`\n",
"* Process HTML Tags and extract data using APIs provided by`beautifulsoup4`\n",
"* Once the data is scraped from HTML pages we can process it by using `pandas` Data Frame APIs. Alternatively, we can also use native collections and associated libraries to process the scraped data.\n",
"\n",
"```shell\n",
"pip install beautifulsoup4\n",
"pip install pandas\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [
{
"data": {
"text/html": [
"<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/hj6yg_8fLpo?rel=0&amp;controls=1&amp;showinfo=0\" frameborder=\"0\" allowfullscreen></iframe>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%HTML\n",
"<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/hj6yg_8fLpo?rel=0&amp;controls=1&amp;showinfo=0\" frameborder=\"0\" allowfullscreen></iframe>"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name: beautifulsoup4\n",
"Version: 4.9.3\n",
"Summary: Screen-scraping library\n",
"Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/\n",
"Author: Leonard Richardson\n",
"Author-email: [email protected]\n",
"License: MIT\n",
"Location: /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages\n",
"Requires: soupsieve\n",
"Required-by: sphinx-book-theme\n"
]
}
],
"source": [
"!pip show beautifulsoup4"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name: pandas\n",
"Version: 1.1.5\n",
"Summary: Powerful data structures for data analysis, time series, and statistics\n",
"Home-page: https://pandas.pydata.org\n",
"Author: None\n",
"Author-email: None\n",
"License: BSD\n",
"Location: /home/itversity/.local/lib/python3.6/site-packages\n",
"Requires: pytz, numpy, python-dateutil\n",
"Required-by: beakerx\n"
]
}
],
"source": [
"!pip show pandas"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Defaulting to user installation because normal site-packages is not writeable\n",
"Requirement already satisfied: beautifulsoup4==4.9.3 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (4.9.3)\n",
"Requirement already satisfied: soupsieve>1.2; python_version >= \"3.0\" in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from beautifulsoup4==4.9.3) (2.1)\n"
]
}
],
"source": [
"!pip install beautifulsoup4==4.9.3"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Defaulting to user installation because normal site-packages is not writeable\n",
"Requirement already satisfied: pandas==1.1.5 in /home/itversity/.local/lib/python3.6/site-packages (1.1.5)\n",
"Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas==1.1.5) (2020.4)\n",
"Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas==1.1.5) (2.8.1)\n",
"Requirement already satisfied: numpy>=1.15.4 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas==1.1.5) (1.19.4)\n",
"Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas==1.1.5) (1.15.0)\n"
]
}
],
"source": [
"!pip install pandas==1.1.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading

0 comments on commit 928c949

Please sign in to comment.