Web scraping with Python

If you need data that's trapped on a website, writing some code to scrape the page could be your solution. This entry-level class will show you how to use the Python programming language to harvest information from websites into a spreadsheet. We'll introduce you to the command line and show you how to write enough code to fetch and parse web content.

Workshop prerequisites: This class is programming for beginners. Some basic familiarity with Python and HTML is helpful but not required.

Class outline

🐍 Python basics (45 minutes to 1 hour)
💧 Water break! (10 minutes)
🔣 HTML Basics (15 minutes)
🛠 Scraping the web (Remaining time)

You will learn...

Some Python basics
- Data types: String, numeric, and Boolean types
- Data structures: Lists and dictionaries
- Control flow: if... else statements
- Iteration: for... in statements
- Functions: Reusable bits of code
How to write and run Python code using Jupyter Notebooks
- Retrieve web content with requests
- Parse meaningful information from raw HTML with beautifulsoup4
- Output tabular data with csv
How to inspect source code in your browser
How to go about getting unstuck

Next steps

Looking to expand on what you've done in this workshop? Here are some new adventures:

Install Python on your own machine and learn how to manage Python dependencies
Learn how to run your scripts from the command line
- 💡 Check out this tutorial) to review the scraping concepts covered in this class and learn the basics of the command line
Keep writing simple scrapers!
- 💡 For inspiration, check out City Scrapers, a collection of scrapers that gathering information on public meetings, written by 60+ contributors of all skill levels
Learn more precise HTML parsing approaches, e.g., lxml and xpath
Graduate to more complicated scraping tasks, e.g., scrapes that rely on state
- 💡 For inspiration, check out python-legistar-scraper, a Python library for scraping legislative data from the Legistar web interface and API

Credits

The content for this course was cribbed heavily from IRE's one-hour course on web scraping with Python.

Some copy in the HTML basics section was lifted from the canonical (to me) First web scraper tutorial, also developed for IRE. When you're ready to move from Jupyter notebooks into the command line, I'd strongly recommend starting with this workshop!

Who am I?

👋 I'm Hannah! I apply my journalism background to civic technology projects as a Lead Developer at DataMade. These include:

Writing a web driver to fill out a branching, stateful web form in service of lowering the barrier to completing a prerequisite to doing business with or receiving funds from the City of Chicago
Maintaining an inter-system scrape, transform, load (ETL) pipeline for legislative data
Managing millions of payroll and pension records to power the Illinois Public Salaries Database and the Illinois Public Pensions Database

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
session		session
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraping with Python

Class outline

You will learn...

Next steps

Credits

Who am I?

About

Releases

Packages

Languages

License

jingrongtong/web-scraping-with-python

Folders and files

Latest commit

History

Repository files navigation

Web scraping with Python

Class outline

You will learn...

Next steps

Credits

Who am I?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages