Explore scraping with BeautifulSoup!

Last update: Oct 05, 2022

Related tags

Overview

beautifulsoup-scrape

Explore scraping with BeautifulSoup!

Part One: Start from Shakespeare

As my professor is a poet (yes, and he teaches me data and database), he loves to give us assignments related to literature.

The start project with BeautifulSoup is scraping the first act of William Shakespeare's The Tempest.

My notebook is shakespeare-scrape.ipynb.

The code includes:

cook a soup doc, or download the html text from a webpage
search certain element like dic/p/ul, or certain attribute like class
locate certain element by .parent or .find_next_sibling()

Part Two: Develop with Supreme Court Decisions

In this case, I scrape the 2020 Supreme Court Decisions.

The notebook is guardian-and-supreme-court.ipynb.

The code includes:

use for loop to print each element in a list
find the link hidden in the attribute
save the output in a list of lists, even a three-deck list

Part Three: More practice with The Guardian

The webpage I scrape is the Best Non-Fiction Books of All Time listed by The Guardian.

The notebook is the same for Part Two!

You will find a surprise if you get the soup doc of that website. Yes! An advertisement hidden in the html!

The code is similar to the last project, but there is more:

list comprehension
list of liiiissssst

Bonus: More Real Shakespeare

In this case, I try to pull out the first 100 lines of Twelfth Night, available here.

The notebook is the same for Part Two!

It's indeed that my professor loves Shakespeare.

I had trouble with this project for a long time because it required each line to contain:

a code for act.scene.line along with whether is the stage direction
the speaker or the last person who spoke prior to the stage direction
a line or stage direction

I figured it out in a very complex way and I believe there is a better way to do it!

Explore scraping with BeautifulSoup!

Related tags

Overview

beautifulsoup-scrape

Part One: Start from Shakespeare

Part Two: Develop with Supreme Court Decisions

Part Three: More practice with The Guardian

Bonus: More Real Shakespeare

Owner

Chuqin

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Scrape puzzle scrambles from csTimer.net

🥫 The simple, fast, and modern web scraping library

Python script for crawling ResearchGate.net papers✨⭐️📎

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

A Python package that scrapes Google News article data while remaining undetected by Google.

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Fundamentus scrapy

Instagram profile scrapper with python

A high-level distributed crawling framework.

Nekopoi scraper using python3

Parse feeds in Python

爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

News, full-text, and article metadata extraction in Python 3. Advanced docs:

A pure-python HTML screen-scraping library

原神爬虫抓取原神界面圣遗物信息

让中国用户使用git从github下载的速度提高1000倍!

Explore scraping with BeautifulSoup!

Related tags

Overview

beautifulsoup-scrape

Part One: Start from Shakespeare

Part Two: Develop with Supreme Court Decisions

Part Three: More practice with The Guardian

Bonus: More Real Shakespeare

Owner

Chuqin

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Scrape puzzle scrambles from csTimer.net

🥫 The simple, fast, and modern web scraping library

Python script for crawling ResearchGate.net papers✨⭐️📎

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

A Python package that scrapes Google News article data while remaining undetected by Google.

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Fundamentus scrapy

Instagram profile scrapper with python

A high-level distributed crawling framework.

Nekopoi scraper using python3

Parse feeds in Python

爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

News, full-text, and article metadata extraction in Python 3. Advanced docs:

A pure-python HTML screen-scraping library

原神爬虫 抓取原神界面圣遗物信息

让中国用户使用git从github下载的速度提高1000倍!

原神爬虫抓取原神界面圣遗物信息