Data Wrangling with Open Refine

Main content

Data Wrangling with Open Refine

 

 

 

Week 2 of the Data Jam focused on Working with Data. In order to get data ready for visualizations and analysis, researchers engage in data wrangling. Wrangling your data with various tools allows you to restructure information so that it can be easily understood by people and analyzed by machines. In this workshop, we worked with OpenRefine, a powerful tool for working with data. Watch the recording and read through the workshop notes to learn more about tidying your data, transforming it from one format to another, and extending it with various web services. Here are the key takeaways we covered in the workshop: 

1. Data Wrangling Means Many Things

Most people think of the data analysis process like this:  

A slide entitled "The Data Process", with three points on a line: Sourcing Data, Wrangling Data, Visualizing Data.

But data wrangling is a term that refers to a number of activities between collecting data and analyzing data, and a much more involved, labor-intensive process.

Slide entitled "The Data Wrangling Process", with a line and multiple points, from left to right: Discovering, Structuring, Cleaning, Enriching, Validating, and Publishing. Each term includes a description of the term.

Often, these activities involve: 

  • discovering (understanding data, what it contains, and what questions or inquiries are productive) 
  • structuring (changing how data is displayed or stored) 
  • cleaning (modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted)  
  • enriching (combining data with additional data to contextualize or expand analysis) 
  • validating (confirming consistency and quality of data) 
  • publishing (planning and publishing data and previous activities) 

2. Messy Data vs. Tidy Data

We discussed many different ways data can be considered "messy", such as missing values, inconsistent entry of data, or unclear structure and organizing. Messy data does not mean that this data is bad or unusable. Messy data can be a good thing! But wrangling messy data makes it easier to use, makes it accessible to you and other people, and contextualizes your data for research questions. So we contrasted “messy data” with “tidy data”, which has the following principles:  

  • Each variable forms a column  
  • Each observation forms a row 
  • Each type of observational unit forms a table 

If you’re interested in learning more about data cleaning and wrangling, check out: 

3. Open Refine Tips

In this workshop, we worked with the Candy Hierarchy data from 2017. Collected by Benjamin R. Cohen and David Ng, these researchers surveyed BoingBoing readers about their emotions towards over 70 different types of candy. We used OpenRefine to clean our dataset to include only respondents between the ages of 18-35 from the United States of America. Cleaning this dataset involved:  

  • Exploring the dataset and its structure to understand what "messy data" problems were present
  • Excluding observations in the dataset to the relevant entries that met our criteria
  • Clustering similar variables to ensure consistent spelling within data entry
  • Saving and exporting the data cleaning steps and datasets

4. Next Steps

This workshop was based on the Data Carpentry lesson, Data Cleaning with OpenRefine for Ecologists. OpenRefine lists tutorials and resources for you to explore. (I also recommend looking at Thomas Padilla’s Getting Started with Open Refine.) If you’re interested in trying more advanced data cleaning techniques with OpenRefine, this lesson from Programming Historian focuses on fetching and parsing historical data.  

OpenRefine isn't the only tool for data wrangling! If you’re already familiar with a particular programming language or software, these guides and tutorials may be helpful.  

About the Author

Emily Esten
Emily Esten
Arnold and Deanne Kaplan Collection of Early American Judaica Curator of Digital Humanities
As the Kaplan Curator, Emily spearheads projects that facilitate access to and use of Penn's Judaica collections, making connections between them and dispersed Judaica content around the globe. She is responsible for curating the Kaplan Collection of Early American Judaica and for rolling out Scribes of the Cairo Geniza project, phase II.

As the inaugural Kaplan Curator, Emily Esten spearheads projects that facilitate access to and use of Penn's Judaica collections, promoting them and making connections between them and dispersed Judaica content around the globe. She is also responsible for curating, building, and researching the Arnold and Deanne Kaplan Collection of Early American Judaica. In addition, she coordinates Scribes of the Cairo Geniza project.

In addition to her role at the Penn Libraries, she is the Web Manager for Contingent Magazine and the Director of Communications for the National Emerging Museum Professionals Network. Previously, she worked at the Edward M. Kennedy Institute for the United States Senate and at Brown University.

Emily holds a Bachelor of Arts degree, with majors in history and digital humanities, from the University of Massachusetts Amherst and a Master's Degree in public humanities from Brown University.