Reconciling Services for Music Metadata in OpenRefine
A tutorial on how to use reconciliation services for Internet Archive and Discogs in OpenRefine.
Introduction
We are working on a large Jewish music project that brings together our disparate and diverse Jewish music collections at Penn Libraries (and beyond) into a database that is accessible for scholars, educators, audiophiles, and the general public.
As part of this work, we want to supplement our data with information that can be found on sites like Discogs and Internet Archive. In order to pull data from these repositories, we need to use a reconciliation service. This will allow us to fuzzy match, for example, the name of an album in a Penn Libraries' database with the album and its associated data in Discogs or Internet Archive. Then we can pull the data to supplement our records.
A Case for Reconciliation
The Robert and Molly Freedman Jewish Sound Archive (referred to hereafter as the Freedman Archive) contains over 4,000 recordings. One of these recordings is Tevya And His Daughters (1957). The Freedman Archive displays the following metadata forTevya And His Daughters (1957):
You can see that this album metadata in the Freedman Archive has lots of information including the album publisher, the number of tracks, the location of publication etc. However, we don't have some important information that we hope to include in our larger music database such as track titles (there are three tracks), the date of publication, and more.
Some of the data that the Freedman Archive does not have can be found through Discogs and Internet Archive.
To harvest this data, we created two reconciliation services. These allow us to take a spreadsheet with album titles from the Freedman Archive and match them up to the titles found in Internet Archive and Discogs. We then can pull the data from these sources easily and decide what data will be useful.
Tutorial for Reconciling in OpenRefine to Discogs and Internet Archive
Note: you must have Python installed and an API personal access token from Discogs (you can get one for free when you make an account). Also, Discogs can be slow due to their rate limiter, so if you have loads of data it might take some time to process.
Creating your Discogs reconciliation service
- First, clone the reconciliation service repository
git clone https://github.com/judaicadh/discogsreconciliation
orgit clone https://github.com/judaicadh/internetarchivereconciliation
This repository is adapted from Michael Stephens with his https://github.com/mikejs/reconcile-demo -
Next, open Terminal or another CLI and navigate to the discogsreconciliation or internetarchivereconciliation folder you've just cloned.
Steps 3–5 are only for the discogs reconciliation service
- Open the discogs.py file in a code editor
- Replace 'YOUR_DISCOGS_API_TOKEN' with your actual Discogs API personal access token.
- Save the discogs_reconciliation.py file
- Install the required Python packages if you haven’t already
pip install Flask flask-cors requests fuzzywuzzy
- Run
python discogs_reconciliation.py
orpython internetarchiverecon.py
Linking your Discogs or Internet Archive reconciliation service to OpenRefine
- Navigate to OpenRefine and open your spreadsheet with the music metadata you hope to reconcile
- Open the menu for column you wish to reconcile -> click Reconcile -> click Start Reconciling. A smaller window should open.
- Select the "Add standard service..." buttons at the bottom of the new window
- Enter
http://127.0.0.1:3000/reconcile
(for Discogs) orhttp://127.0.0.1:9000/reconcile
(for Internet Archive) as your reconciliation address - Select "Add service"
- The service should now appear in the list of reconciliation services.
- You can select the service that you just added
- You should see a window similar to the one below. If you are reconciling Artist names, select Artist to reconcile against, same for the other options (Release, Master, Label) (The Internet Archive service does not have this feature).
- Then click "Start reconciling..." You'll see your results like you would with any other reconciliation service (see documentation here)
What's Next?
That's next weeks blog. Essentially, you will need to verify the matches that you've made with these reconciliation services, then you will pull the supplemental data, and finally you'll curate the data.
Also, if you want to listen to Tevya And His Daughters, you can play the LP below.
Date
June 4, 2024