Scraping Open Data from the Web with BeautifulSoup

Interested in retrieving data from the web? Learn how to scrape open data from websites with BeautifulSoup Python package along with the best practices.

During the first week of Research Data and Digital Scholarship Data Jam 2021 we discussed about “Sourcing the Data” by “Scraping Open Data from the Web”.

Here are the workshop materials including slides and Python code exercise.

Prerequisites

The “Web Scraping with BeautifulSoup” workshop presumes the attendees to have some knowledge of HTML/CSS and Python. The required software includes Jupyter Notebook, and Python packages like pip, sys, urllib.request, and bs4.

Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code. Here we use the Python Requests library which enables us to download a web page. Then we use the Python BeautifulSoup library to extract and parse the relevant parts of the web page in HTML or XML format.

Farming the HTML Tags

The secret to scraping a webpage are the ingredients: These include the web page that is being scraped, the inspect developer tool, the tags and tag branch of the exact section of the web page being scraped, and finally, the Python script.

Structure of a Regular Web Page

Before we can do web scraping, we need to understand the structure of the web page we're working with and then extract parts of that structure.

<html>
	<head>
		<title>
		</title>
	</head>
	
	<body>
		<p>
		</p>
	</body>
	
</html>

Python Demo

Download the latest versions of Python and Anaconda3 depending on your system’s Operating System.
Open the Anaconda Navigator and select Jupyter Notebook.
Navigate to the directory where you want to hold your files. Create or upload a notebook, marked by .ipynb file extension.

Install the BeautifulSoup4 package bs4.

# Importing Python libraries
import bs4 as BeautifulSoup
import urllib.request

Scenario 1: I want to know more about storing fruits for the winter, using Wikipedia for collecting data.

For this workshop we will be using the Wikipedia web page on Fruit preserves.

Inspect the webpage by right-clicking on the required data or pressing the F12 key

Wikipedia inspect developer tool for fruit jam

Get the link

# Fetching the content from the Wikipedia URL
get_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Fruit_preserves')

read_page = get_data.read()

Parse the text data

# Parsing the Wikipedia URL content and storing the page text
parse_page = BeautifulSoup.BeautifulSoup(read_page,'html.parser')

# Returning all the <p> paragraph tags
paragraphs = parse_page.find_all('p')
page_content = ''

Add text to string as is

# Looping through each of the paragraphs and adding them to the variable
for p in paragraphs:  
    page_content += p.text
                    
    # Make the paragraph tags readable
    print(p.prettify())

Add text to string for paragraphs

for p in paragraphs:  
    page_content += p.text
    print(p.get_text())

Scenario 2: I want to collect all European Paintings in the Philadelphia Museum of Art artworks collections with fruits in them.

Understand the webpage data

Go to the museum website https://philamuseum.org/
Go to the online collection https://philamuseum.org/collections/search.html
Select the European painting filter and search for the keyword "fruit"
Inspect the webpage https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1
Identify the container tag. Here it is the class "pinch"

Screenshot of Philadelphia Museum of Art website search results of fruit with developer tools

Identify the link tag "a href"

Screenshot of Philadelphia Museum of Art website links of fruit with developer tools

Import the libraries and fetch the webpage

import requests
from bs4 import BeautifulSoup

page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1')

soup = BeautifulSoup(page.text, 'html.parser')

Select the HTML tags

# Locating the HTML tags for the hyperlinks
art = soup.find(class_='pinch')
art_objects = art.find_all('a')

for artwork in art_objects:
    links = artwork.get('href')
    print(links)

Export to CSV (comma separated value) table

import csv
import requests
from bs4 import BeautifulSoup

page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1')

soup = BeautifulSoup(page.text, 'html.parser')

art = soup.find(class_='pinch')
art_objects = art.find_all('a')

for artwork in art_objects:
    links = artwork.get('href')
    
# Open a csv file to write the output in
f = csv.writer(open('pma.csv', 'w'))

for artwork in art_objects:
    links = 'https://philamuseum.org' + artwork.get('href')
    
    # Insert each iteration's output into the csv file
    f.writerow([links])

Why Scrape Data?

Open Data Collection from obsolete or expired websites
Open Data Access in the absence of an API (Application Programming Interface)
Automated real-time data harvesting
Data Aggregation
Data Monitoring

Challenges

For effective web scraping, script customizations, error handling, data cleaning, and results storage, are necessary steps and thus making the process time and resources intensive.
Websites are Dynamic and always in development. The content of a website changing reflects in the tags that are selected in your script. This is especially important to consider in the case of real-time data scraping.
Python packages and updates sometimes constitute in unstable scripts
It is extremely important to throttle your request or cap it to a specific amount for server considerations to not overload the server usage.

Spoonfuls of Data Helps the Servers NOT Go Down

Here are resources on learning more about University of Pennsylvania Libraries Policy of server usage:

Alternatives

ScraPy – Framework for large scale web scraping
Selenium – Browser Automation Library for Web Scraping using JavaScript
rvest – R package for multi-page web scraping
API (Application Programming Interface) – Direct data requests from the source
DOM (Document Object Model) parsing
Pattern matching text – Using regex (regular expressions) for selecting text
Manually copy-pasting

Resources

ProgrammingHistorian Intro to BeautifulSoup Tutorial
BeautifulSoup Documentation
DataQuest Interactive Tutorial – Contact Research Data and Digital Scholarship for license
Mozilla Developer Network HTML Basics
Mozilla Developer Network HTML elements reference

Author

Jajwalya Karajgikar

Applied Data Science Librarian

Date

December 17, 2021

Department

Research Data & Digital Scholarship

Scraping Open Data from the Web with BeautifulSoup

Prerequisites

Farming the HTML Tags

Structure of a Regular Web Page

Python Demo

Scenario 1: I want to know more about storing fruits for the winter, using Wikipedia for collecting data.

Scenario 2: I want to collect all European Paintings in the Philadelphia Museum of Art artworks collections with fruits in them.

Why Scrape Data?

Challenges

Spoonfuls of Data Helps the Servers NOT Go Down

Alternatives

Resources

Author

Date

Tags

Department

Maps and More

Staff Information

Prerequisites

Farming the HTML Tags

Structure of a Regular Web Page

Python Demo

Scenario 1: I want to know more about storing fruits for the winter, using Wikipedia for collecting data.

Scenario 2: I want to collect all European Paintings in the Philadelphia Museum of Art artworks collections with fruits in them.

Why Scrape Data?

Challenges

Spoonfuls of Data Helps the Servers NOT Go Down

Alternatives

Resources

Author

Date

Tags

Department

Share

Maps and More

Staff Information