- Download Python Interpreter 3+ https://www.python.org/downloads/
- for easy Direct download Link : python-3.11.5-amd64.exe
- Download Integrated Development Environments “IDE” PyCharm Community Edition main page https://www.jetbrains.com/pycharm/download/ (the bottom black section with is free)
- for easy Direct download Link : pycharm
- Create an account on github - sign up on github if you don't have already :
- Main Site : https://github.com/
- Current Direct Link for signup
- download git locally on your laptop as well: Download Link
Scrapy is an open-source web crawling framework for Python. It facilitates the extraction of data from websites and supports robust, efficient, and flexible scraping. With built-in features like middleware and pipelines, Scrapy provides a comprehensive solution for web scraping tasks.
pip install scrapy
scrapy startproject myscrapyproject
cd myscrapyproject
scrapy genspider myspider https://en.wikipedia.org/wiki/Python_(programming_language)
scrapy crawl myspider
scrapy crawl myspider -o output.json
scrapy crawl myspider -o output.csv
scrapy crawl myspider -o output.xml
scrapy shell https://en.wikipedia.org/wiki/Python_(programming_language)
>>> response.css('title::text').get()
>>> response.css('#firstHeading > span::text').get()
>>> response.css('#firstHeading').get()
>>> response.css('div#mw-content-text > div.mw-content-ltr.mw-parser-output > p:nth-child(6)').get()
>>> response.css('div#mw-content-text > div.mw-content-ltr.mw-parser-output > p').getall()
>>> response.css('div#mw-content-text > div.mw-content-ltr.mw-parser-output > p').getall()[4]
>>> response.css('div#mw-content-text > div.mw-content-ltr.mw-parser-output > p').getall()[4].strip().replace('\n', '')
>>> response.css('div#mw-content-text > div.mw-content-ltr.mw-parser-output > p').getall()[4].strip().replace('\n', '')
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It provides Pythonic idioms for iterating, searching, and modifying the parse tree. Beautiful Soup transforms complex HTML documents into a tree of Python objects, simplifying web scraping tasks by offering intuitive methods to navigate and search the parsed content.
pip install requests beautifulsoup4
pip install -r requirements.txt
git init
git remote add origin https://github.com/ahmedredahussien/sprints-webscrapping.git
git add .
git commit -m "Initial commit"
git pull origin master --allow-unrelated-histories
> Normal first time push :
git push -u origin master
git checkout -b my-feature
> Optional in case that its new file:
git add README.md
git commit README.md -m "add git steps to feature branch""
> Normal commit push after 1st time :
git push origin my-feature
git checkout master
git merge my-feature
git push origin master
> Normal delete :
git branch -d feature/my-feature
> Force delete :
git branch -D feature/my-feature
git push -u --force origin master
git clone https://github.com/ahmedredahussien/WebScraping.git WebScraping
git reset --hard origin/master
Direct Change to Master
changed on server