Access to the College Green area of campus will be restricted until further notice. Current students, faculty, and staff with a valid Penn card, as well as Alumni Weekend attendees with name badges, may enter and exit Van Pelt-Dietrich Library Center through the Rosengarten Undergraduate Study Center on the ground floor, and may enter and exit the Fisher Fine Arts Library through the 34th Street entrance to Meyerson Hall. See our Service Alerts for details.

Slide titled "Data Jam 2021 Web Scraping with BeautifulSoup" watermarked "Penn Libraries" logo and "Research Data and Digital Scholarship" Data Jam logo of a cat in a jam jar with Copyright Commons CC-BY 4.0 linked on the page

During the first week of Research Data and Digital Scholarship Data Jam 2021 we discussed about “Sourcing the Data” by “Scraping Open Data from the Web”.

Here are the workshop materials including slides and Python code exercise.

Prerequisites

The “Web Scraping with BeautifulSoup” workshop presumes the attendees to have some knowledge of HTML/CSS and Python. The required software includes Jupyter Notebook, and Python packages like pip, sys, urllib.request, and bs4.

Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code. Here we use the Python Requests library which enables us to download a web page. Then we use the Python BeautifulSoup library to extract and parse the relevant parts of the web page in HTML or XML format.

Farming the HTML Tags

The secret to scraping a webpage are the ingredients: These include the web page that is being scraped, the inspect developer tool, the tags and tag branch of the exact section of the web page being scraped, and finally, the Python script.

Structure of a Regular Web Page

Before we can do web scraping, we need to understand the structure of the web page we're working with and then extract parts of that structure.

<html>
	<head>
		<title>
		</title>
	</head>
	
	<body>
		<p>
		</p>
	</body>
	
</html>
Python Demo
  1. Download the latest  versions of Python and Anaconda3 depending on your system’s Operating System. 
  2. Open the Anaconda Navigator and select Jupyter Notebook.
  3. Navigate to the directory where you want to hold your files. Create or upload a notebook, marked by .ipynb file extension.
  4. Install the BeautifulSoup4 package bs4.

    # Importing Python libraries
    import bs4 as BeautifulSoup
    import urllib.request 

Scenario 1: I want to know more about storing fruits for the winter, using Wikipedia for collecting data.

For this workshop we will be using the Wikipedia web page on Fruit preserves

  1. Inspect the webpage by right-clicking on the required data or pressing the F12 key
Wikipedia inspect developer tool for fruit jam
  1. Get the link

    # Fetching the content from the Wikipedia URL
    get_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Fruit_preserves')
    
    read_page = get_data.read()
  2. Parse the text data

    # Parsing the Wikipedia URL content and storing the page text
    parse_page = BeautifulSoup.BeautifulSoup(read_page,'html.parser')
    
    # Returning all the <p> paragraph tags
    paragraphs = parse_page.find_all('p')
    page_content = ''
  3. Add text to string as is

    # Looping through each of the paragraphs and adding them to the variable
    for p in paragraphs:  
        page_content += p.text
                        
        # Make the paragraph tags readable
        print(p.prettify())
  4. Add text to string for paragraphs

    for p in paragraphs:  
        page_content += p.text
        print(p.get_text())

Scenario 2: I want to collect all European Paintings in the Philadelphia Museum of Art artworks collections with fruits in them.

  1. Understand the webpage data
Screenshot of Philadelphia Museum of Art website search results of fruit with developer tools
  • Identify the link tag "a href" 
Screenshot of Philadelphia Museum of Art website links of fruit with developer tools
  1. Import the libraries and fetch the webpage

    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1')
    
    soup = BeautifulSoup(page.text, 'html.parser')
  2. Select the HTML tags

    # Locating the HTML tags for the hyperlinks
    art = soup.find(class_='pinch')
    art_objects = art.find_all('a')
    
    for artwork in art_objects:
        links = artwork.get('href')
        print(links)
  3. Export to CSV (comma separated value) table

    import csv
    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get('https://philamuseum.org/collections/results.html?searchTxt=fruit&searchNameID=&searchClassID=&searchOrigin=&searchDeptID=5&keySearch2=&page=1')
    
    soup = BeautifulSoup(page.text, 'html.parser')
    
    art = soup.find(class_='pinch')
    art_objects = art.find_all('a')
    
    for artwork in art_objects:
        links = artwork.get('href')
        
    # Open a csv file to write the output in
    f = csv.writer(open('pma.csv', 'w'))
    
    for artwork in art_objects:
        links = 'https://philamuseum.org' + artwork.get('href')
        
        # Insert each iteration's output into the csv file
        f.writerow([links])

Why Scrape Data?

  • Open Data Collection from obsolete or expired websites
  • Open Data Access in the absence of an API (Application Programming Interface)
  • Automated real-time data harvesting
  • Data Aggregation
  • Data Monitoring

Challenges

  • For effective web scraping, script customizations, error handling, data cleaning, and results storage, are necessary steps and thus making the process time and resources intensive.
  • Websites are Dynamic and always in development. The content of a website changing reflects in the tags that are selected in your script. This is especially important to consider in the case of real-time data scraping.
  • Python packages and updates sometimes constitute in unstable scripts
  • It is extremely important to throttle your request or cap it to a specific amount for server considerations to not overload the server usage. 

Spoonfuls of Data Helps the Servers NOT Go Down

Here are resources on learning more about University of Pennsylvania Libraries Policy of server usage:

Alternatives

  • ScraPy – Framework for large scale web scraping
  • Selenium – Browser Automation Library for Web Scraping using JavaScript
  • rvest – R package for multi-page web scraping
  • API (Application Programming Interface) – Direct data requests from the source
  • DOM (Document Object Model) parsing
  • Pattern matching text – Using regex (regular expressions) for selecting text
  • Manually copy-pasting

Resources